Pure Python HWP Parser
hwplib-py allows you to analyze HWP 5.0 files using pure Python. It parses the binary OLE2 structure directly, giving you access to text, tables, and control objects without needing the Hancom Office software.
Installation
pip install hwplib-py
π Code Examples (Cookbook)
Here are friendly, step-by-step examples for every major feature.
1. Loading & Metadata
Start here. Load a file and check its version and properties.
from hwplib.hwp5.api import load
# Load the file
doc = load("example.hwp")
# Check File Header
print(f"HWP Version: {doc.header.version_str}") # e.g., "5.0.2.1"
print(f"Compressed: {doc.header.is_compressed}")
print(f"Encrypted: {doc.header.is_encrypted}")
# Check Document Information (Metadata)
print(f"Fonts used: {len(doc.doc_info.face_names)}")
for face in doc.doc_info.face_names:
print(f" - Font Name: {face.name}")
2. Text Extraction
Extract all text content from the document, including text inside tables and text boxes.
# Simple global extraction
full_text = doc.get_text()
print(full_text)
# Manual iteration (Section -> Paragraph)
for i, section in enumerate(doc.sections):
print(f"--- Section {i} ---")
for paragraph in section.paragraphs:
# 'paragraph.text' contains the plain text of the paragraph
print(paragraph.text)
3. Handling Tables
Tables are special Controls. You can iterate through rows and cells to get structured data.
from hwplib.hwp5.core.control import ControlTable
for section in doc.sections:
for paragraph in section.paragraphs:
for ctrl in paragraph.controls:
# Check if this control is a Table
if isinstance(ctrl, ControlTable):
print(f"Found Table: {ctrl.row_count} Rows, {ctrl.col_count} Cols")
# Iterate Rows
for r_idx, row in enumerate(ctrl.rows):
print(f" Row {r_idx}:")
# Iterate Cells in the Row
for c_idx, cell in enumerate(row.cells):
# A Cell contains a list of Paragraphs!
cell_text = " ".join([p.text for p in cell.paragraphs])
print(f" Cell[{c_idx}]: {cell_text}")
4. Equations (Math)
Extract the method script (syntax similar to LaTeX) from equation objects.
from hwplib.hwp5.core.control import ControlEquation
for section in doc.sections:
for paragraph in section.paragraphs:
for ctrl in paragraph.controls:
if isinstance(ctrl, ControlEquation):
# The 'script' attribute holds the equation string
print(f"Equation Script: {ctrl.script}")
# Example Output: "y = ax^2 + bx + c"
5. Pictures & Images
Get information about embedded images.
from hwplib.hwp5.core.control import ControlPicture
for section in doc.sections:
for paragraph in section.paragraphs:
for ctrl in paragraph.controls:
if isinstance(ctrl, ControlPicture):
print(f"Image Found:")
print(f" Size: {ctrl.width} x {ctrl.height}")
# 'bin_item_id' links to the actual binary data in the BinData stream
print(f" BinData ID: {ctrl.bin_item_id}")
6. Shapes (Lines, Rects, Polygons)
Access vector drawing objects (GSO).
from hwplib.hwp5.core.control import ControlLine, ControlRect, ControlPolygon
for section in doc.sections:
for paragraph in section.paragraphs:
for ctrl in paragraph.controls:
if isinstance(ctrl, ControlLine):
print(f"Line from ({ctrl.start_x}, {ctrl.start_y}) to ({ctrl.end_x}, {ctrl.end_y})")
elif isinstance(ctrl, ControlRect):
print(f"Rectangle: {ctrl.width} x {ctrl.height}")
elif isinstance(ctrl, ControlPolygon):
print(f"Polygon with {len(ctrl.points)} vertices")
7. JSON Export
Convert the entire document structure to JSON for easy processing in other languages.
from hwplib.hwp5.core.exporter import HwpJsonExporter
import json
exporter = HwpJsonExporter()
data = exporter.export(doc)
# Save to file
with open("output.json", "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
Object Model
The library maps the HWP binary structure to these Python objects:
HwpDocument
βββ header (HwpFileHeader)
βββ doc_info (DocInfo)
β βββ face_names[] (Font Names)
β βββ border_fills[]
β βββ styles[]
βββ sections[] (List[Section])
βββ paragraphs[] (List[Paragraph])
βββ text (String)
βββ controls[] (List[HwpControl])
βββ ControlTable
β βββ rows[] -> cells[] -> paragraphs[]
βββ ControlPicture
βββ ControlEquation
βββ ControlRect/Line/Ellipse...
Legal Notice
λ³Έ μ νμ (μ£Ό)νκΈκ³Όμ»΄ν¨ν°μ νκΈ λ¬Έμ νμΌ(.hwp) κ³΅κ° λ¬Έμλ₯Ό μ°Έκ³ νμ¬ κ°λ°νμμ΅λλ€.
(This product was developed by referring to Hancom Inc.'s public HWP file format documentation.)
Copyright © 2026 CHoi Minseo. Licensed under Apache 2.0.