Pure Python HWP Parser

hwplib-py allows you to analyze HWP 5.0 files using pure Python. It parses the binary OLE2 structure directly, giving you access to text, tables, and control objects without needing the Hancom Office software.

Installation

pip install hwplib-py

πŸ“˜ Code Examples (Cookbook)

Here are friendly, step-by-step examples for every major feature.

1. Loading & Metadata

Start here. Load a file and check its version and properties.

from hwplib.hwp5.api import load

# Load the file
doc = load("example.hwp")

# Check File Header
print(f"HWP Version: {doc.header.version_str}")  # e.g., "5.0.2.1"
print(f"Compressed:  {doc.header.is_compressed}")
print(f"Encrypted:   {doc.header.is_encrypted}")

# Check Document Information (Metadata)
print(f"Fonts used: {len(doc.doc_info.face_names)}")
for face in doc.doc_info.face_names:
    print(f" - Font Name: {face.name}")

2. Text Extraction

Extract all text content from the document, including text inside tables and text boxes.

# Simple global extraction
full_text = doc.get_text()
print(full_text)

# Manual iteration (Section -> Paragraph)
for i, section in enumerate(doc.sections):
    print(f"--- Section {i} ---")
    for paragraph in section.paragraphs:
        # 'paragraph.text' contains the plain text of the paragraph
        print(paragraph.text)

3. Handling Tables

Tables are special Controls. You can iterate through rows and cells to get structured data.

from hwplib.hwp5.core.control import ControlTable

for section in doc.sections:
    for paragraph in section.paragraphs:
        for ctrl in paragraph.controls:
            
            # Check if this control is a Table
            if isinstance(ctrl, ControlTable):
                print(f"Found Table: {ctrl.row_count} Rows, {ctrl.col_count} Cols")
                
                # Iterate Rows
                for r_idx, row in enumerate(ctrl.rows):
                    print(f"  Row {r_idx}:")
                    
                    # Iterate Cells in the Row
                    for c_idx, cell in enumerate(row.cells):
                        # A Cell contains a list of Paragraphs!
                        cell_text = " ".join([p.text for p in cell.paragraphs])
                        print(f"    Cell[{c_idx}]: {cell_text}")

4. Equations (Math)

Extract the method script (syntax similar to LaTeX) from equation objects.

from hwplib.hwp5.core.control import ControlEquation

for section in doc.sections:
    for paragraph in section.paragraphs:
        for ctrl in paragraph.controls:
            
            if isinstance(ctrl, ControlEquation):
                # The 'script' attribute holds the equation string
                print(f"Equation Script: {ctrl.script}")
                # Example Output: "y = ax^2 + bx + c"

5. Pictures & Images

Get information about embedded images.

from hwplib.hwp5.core.control import ControlPicture

for section in doc.sections:
    for paragraph in section.paragraphs:
        for ctrl in paragraph.controls:
            
            if isinstance(ctrl, ControlPicture):
                print(f"Image Found:")
                print(f"  Size: {ctrl.width} x {ctrl.height}")
                # 'bin_item_id' links to the actual binary data in the BinData stream
                print(f"  BinData ID: {ctrl.bin_item_id}")

6. Shapes (Lines, Rects, Polygons)

Access vector drawing objects (GSO).

from hwplib.hwp5.core.control import ControlLine, ControlRect, ControlPolygon

for section in doc.sections:
    for paragraph in section.paragraphs:
        for ctrl in paragraph.controls:
            
            if isinstance(ctrl, ControlLine):
                print(f"Line from ({ctrl.start_x}, {ctrl.start_y}) to ({ctrl.end_x}, {ctrl.end_y})")
                
            elif isinstance(ctrl, ControlRect):
                print(f"Rectangle: {ctrl.width} x {ctrl.height}")
                
            elif isinstance(ctrl, ControlPolygon):
                print(f"Polygon with {len(ctrl.points)} vertices")

7. JSON Export

Convert the entire document structure to JSON for easy processing in other languages.

from hwplib.hwp5.core.exporter import HwpJsonExporter
import json

exporter = HwpJsonExporter()
data = exporter.export(doc)

# Save to file
with open("output.json", "w", encoding="utf-8") as f:
    json.dump(data, f, indent=2, ensure_ascii=False)

Object Model

The library maps the HWP binary structure to these Python objects:

HwpDocument
β”œβ”€β”€ header (HwpFileHeader)
β”œβ”€β”€ doc_info (DocInfo)
β”‚   β”œβ”€β”€ face_names[] (Font Names)
β”‚   β”œβ”€β”€ border_fills[]
β”‚   └── styles[]
└── sections[] (List[Section])
    └── paragraphs[] (List[Paragraph])
        β”œβ”€β”€ text (String)
        └── controls[] (List[HwpControl])
            β”œβ”€β”€ ControlTable
            β”‚   └── rows[] -> cells[] -> paragraphs[]
            β”œβ”€β”€ ControlPicture
            β”œβ”€β”€ ControlEquation
            └── ControlRect/Line/Ellipse...