The pipeline, step by step

Every conversion — whether triggered manually, via API, or by a folder watch — follows the same path.

  1. 1
    Connect your cloud storage

    You authorize Vlkea Parse to access a specific folder in Google Drive, OneDrive, or Dropbox using OAuth. You choose which folder is watched and where converted files are written. We request the minimum scope required — we do not request broad account access.

  2. 2
    File downloaded to an in-memory buffer

    When a conversion runs, your file is downloaded to a RAM-backed temporary filesystem (tmpfs). It never touches disk storage. In production, this is enforced at the configuration level — the service refuses to start if the temporary directory is not on a memory filesystem.

  3. 3
    Content validated by binary inspection

    The file's binary signature (magic bytes) is checked against its declared type. Files that don't match are rejected before any conversion attempt. File extensions are not trusted.

  4. 4
    Conversion in an isolated process

    For DOCX, HTML, EPUB, RTF, ODT, and similar formats: conversion runs in an isolated subprocess. For PDFs: a dedicated GPU processing service is used — explained in detail below.

  5. 5
    Output sanitized

    The resulting Markdown is sanitized before any write-back, removing patterns that could cause issues downstream in your pipeline.

  6. 6
    Written back to your storage — buffer cleared

    The Markdown file is written to the output folder you specified in your cloud storage. The in-memory buffer is released. Nothing is retained on our systems. A cleanup task runs every 10 minutes to remove any orphaned temporary files left by interrupted jobs.

What we store — and what we don't

We store operational metadata to power conversion history and enforce quotas. Document content is never stored.

We store

  • Filename, file size, file type
  • Conversion status (succeeded / failed)
  • Duration and page count
  • Timestamps
  • Your account email

We never store

  • × Document content — in the database, cache, or logs
  • × Converted Markdown content
  • × Content in error messages or monitoring reports
  • × Content in the job queue (queue runs with no persistence)

PDF processing — full transparency

This section requires special attention for security-conscious evaluators.

Why PDF conversion involves a GPU service

And exactly what that service sees.

PDFs cannot be accurately converted using traditional text extraction. Most real-world PDFs — contracts, financial reports, clinical documents — store text as rendered glyphs, not as machine-readable characters. The page is effectively an image. Accurate structural extraction requires a vision model that can read the page the same way a human would.

When you convert a PDF, here is exactly what happens:

  1. 1.PDF pages are rendered to images in memory on our servers — not saved to disk.
  2. 2.The images are sent to a dedicated GPU processing service within our own infrastructure. This is not a third-party AI provider — not OpenAI, Google, Anthropic, or any commercial API. It is our own isolated service with no public URL, no database, and no content logging.
  3. 3.The service receives page images only — not your original file, not your account identity, not any metadata.
  4. 4.The service returns extracted Markdown. The images are discarded. Nothing persists after the request completes.
For organizations that cannot accept any external processing

We support on-premise deployment. The full conversion pipeline — including the vision model for PDFs — runs within your own infrastructure. Documents are processed on your servers, on your hardware. No data leaves your network at any stage. Get in touch to discuss.

Cloud storage access — what we can and can't do

Vlkea Parse uses OAuth to connect to your cloud storage. Here is the precise scope of that access.

What we request

Read access to list and download files from the folder(s) you select. Write access to create output Markdown files in the folder you choose. We request the minimum scope required — we do not request broad account access.

Scope and access — the honest picture

Our application code only reads from and writes to the folder(s) you select — we never query outside them. That said, the OAuth token itself grants broader access than just your chosen folder (this is a limitation of how Google Drive and OneDrive OAuth works — no folder-level scope exists). You can see exactly what access was granted and revoke it at any time from your Google, Microsoft, or Dropbox account security settings — independently of anything we say.

How tokens are stored

OAuth access and refresh tokens are encrypted at rest using envelope encryption — a unique encryption key is generated per token. Tokens are never stored in plain text.

How to revoke access

Disconnect directly from your Google, Microsoft, or Dropbox account security settings — or from within Vlkea Parse settings. Revocation is immediate. We cannot read or write to your storage after that point.

Security architecture

Technical facts for reviewers.

File validation

Content validated by magic bytes (binary signature inspection). File extensions are not trusted. Files that don't match their declared type are rejected before processing begins.

Process isolation

Each document is converted in a separate subprocess. A crash or failure in one conversion cannot affect others. PDF processing runs in a completely separate service.

No content in errors

Error messages use generic codes only. No file content appears in application logs, error responses, or monitoring reports. Error tracking is configured with PII disabled.

Output sanitization

Converted Markdown is sanitized before being written to your cloud storage. Malicious patterns introduced by document content cannot propagate to output files.

CSRF protection

All state-changing requests require a CSRF token. Exempt only: Bearer API calls and health check endpoints.

Authentication

Dashboard sessions use short-lived JWT tokens (15-minute TTL) with automatic renewal. REST API uses bcrypt-hashed API keys. MCP integrations use OAuth 2.1 with PKCE.

Privacy — plain language

No legal boilerplate. Here is what we actually do with your data.

What we use your data for

  • To show you conversion history in the dashboard
  • To enforce conversion quotas (rate limits per plan)
  • To process billing — Stripe handles payment details; we store only your subscription status

What we don't do

  • We don't sell your data or share it with advertisers
  • We don't use your documents to train AI models — we don't have your documents
  • We don't retain document content, anywhere, ever

Your rights

  • Export: Download a full copy of your stored data at any time from Settings → Account
  • Delete: Account deletion removes all your data from our systems immediately and permanently
  • Automatic retention: Conversion metadata is automatically deleted after 90 days — no manual action required

On-premise deployment

If your organization's policy requires that documents never leave your own infrastructure — whether for regulatory compliance, data sovereignty, or internal security requirements — we support full on-premise deployment.

  • The complete conversion pipeline runs within your infrastructure
  • The vision model for PDF processing runs on your hardware
  • No data leaves your network at any stage of processing
  • You maintain full control over retention, access, and auditing
Get in touch about on-premise

Questions? [email protected]