ESC to close

SEC Filings Document Splitting API

SEC filings are not simple, singular files; they are massive, monolithic SGML containers. Across our database of over 18 million filings, these containers hold the primary HTML reports, machine-readable XML/XBRL data, embedded graphics, and legal exhibits all fused into one text stream. The Document Splitting API systematically dissects these containers. It provides a structured manifest of every individual file, allowing you to instantly bypass the clutter and fetch precise, discrete assets—whether you need the clean primary HTML of a 10-K, the raw calculation linkbases, or base64-decoded corporate logos.

Our Architecture: Pre-Split & Zero Latency

Extracting discrete files natively from SEC EDGAR requires downloading massive, uncompressed text archives and manually parsing outdated SGML boundaries. Our backend engine proactively splits, cleans, type-casts, and indexes every individual primary document, XML, and graphic inside the container the exact moment it is published. This guarantees instantaneous, sub-second manifest and file retrieval across our entire historical database.

1. Fetch Document Manifest

Pass the Accession Number to retrieve the pre-split manifest. The response includes the exact filenames, sequence numbers, extensions, and standardized document types for every file parsed from the filing container.

GET https://data.businessquant.com/secfilings/{accession}/documents?api_key={api_key}

Request Parameters

Parameter Description
accession
Required
The unique Accession Number of the SEC filing container.
Format: 0000320193-23-000106
api_key
Required
Your API key for authentication and billing.
Format: api_key=YOUR_API_KEY
Request Example (JSON Manifest)
curl -X GET "https://data.businessquant.com/secfilings/0000320193-23-000106/documents?api_key=YOUR_API_KEY"
Response Artifact (HATEOAS Linked)
{
    "metadata": {
        "cik": 320193,
        "ticker": "AAPL",
        "accession": "0000320193-23-000106",
        "total_records": 3
    },
    "data": [
        {
            "sequence": 1,
            "filename": "aapl-20230930.htm",
            "type": "10-K",
            "description": "FORM 10-K",
            "extension": "htm",
            "content_url": "https://data.businessquant.com/secfilings/0000320193-23-000106/content?filename=aapl-20230930.htm&api_key=YOUR_API_KEY"
        },
        {
            "sequence": 2,
            "filename": "ex21-1.htm",
            "type": "EX-21",
            "description": "SUBSIDIARIES OF THE REGISTRANT",
            "extension": "htm",
            "content_url": "https://data.businessquant.com/secfilings/0000320193-23-000106/content?filename=ex21-1.htm&api_key=YOUR_API_KEY"
        },
        {
            "sequence": 3,
            "filename": "aapl-20230930_cal.xml",
            "type": "EX-101.CAL",
            "description": "XBRL TAXONOMY EXTENSION CALCULATION LINKBASE DOCUMENT",
            "extension": "xml",
            "content_url": "https://data.businessquant.com/secfilings/0000320193-23-000106/content?filename=aapl-20230930_cal.xml&api_key=YOUR_API_KEY"
        }
    ]
}

2. Fetch Specific Document Content

Once you identify the exact file you need from the manifest, query the Content endpoint using the filename to retrieve the raw HTML, XML, or binary Graphic data.

GET https://data.businessquant.com/secfilings/{accession}/content?filename={filename}&api_key={api_key}

Request Parameters

Parameter Description
accession
Required
The unique Accession Number of the SEC filing.
Format: 0000320193-23-000106
filename
Required
The exact filename mapped directly from the manifest response.
Format: filename=aapl-20230930.htm
api_key
Required
Your API key for authentication and billing.
Format: api_key=YOUR_API_KEY
Request Example (Fetch HTML/XML)
curl -X GET "https://data.businessquant.com/secfilings/0000320193-23-000106/content?filename=aapl-20230930.htm&api_key=YOUR_API_KEY"
Response Artifact (Raw File Bytes)
Note: This endpoint explicitly responds with a text/html, application/xml, or image/jpeg content-type header, unlike the rest of the JSON API.
<html>
    <head>
        <title>Apple Inc. - Form 10-K</title>
    </head>
    <body>
        <p style="font-weight:bold;text-align:center;">UNITED STATES</p>
        <p style="font-weight:bold;text-align:center;">SECURITIES AND EXCHANGE COMMISSION</p>
        <p style="text-align:center;">Washington, D.C. 20549</p>
        <p style="font-weight:bold;text-align:center;">FORM 10-K</p>
        <!-- ... (Full body of the primary filing) ... -->
    </body>
</html>
            

3. Supported Document Classes

The Document Splitting API systematically breaks down the entire SGML container. When scanning the document manifest, you will encounter the following broad categories of extracted files. Understanding these classes is crucial for building robust NLP and quantitative data pipelines.

The Primary HTML Filing & Base Media

The most valuable asset in the container is the primary HTML document. SEC EDGAR files often embed XBRL elements (<ix:nonFraction>) directly into the text, which breaks standard web scrapers. Our engine proactively strips out these inline XBRL tags, namespaces, and SGML headers to return the pure, clean HTML of the actual report.

Document Class / TypeTypical ExtensionStandard Description
Primary HTML Filing.htm / .htmlThe main body of the SEC filing (e.g., the actual 10-K, 10-Q, or 8-K text). Our engine has cleaned the raw SGML and stripped out inline XBRL data to return pure HTML. This is typically assigned Sequence 1 in the container.
GRAPHIC.jpg / .gifEmbedded images, charts, and logos. The engine natively decodes the base64/uuencoded streams in the SGML and returns raw binary image files.
XML.xmlGeneral XML data files. This includes structured data payloads like the raw matrices found in Form 4 Insider Transactions or Form 13F Institutional Holdings.
CORRESP.htm / .txtCorrespondence. Uploads of correspondence between the SEC and the company regarding filing reviews or administrative matters.
INFORMATION TABLE.xmlSpecific to 13F filings, this contains the actual XML table of the institution's equity holdings.

XBRL / Machine-Readable Data Linkbases

If you are building custom financial data models, you can bypass the HTML entirely and fetch the raw XML linkbases. These files dictate the exact mathematical and structural relationships of every financial data point the company reported.

Document Class / TypeTypical ExtensionStandard Description
EX-101.INS.xmlXBRL Instance Document. The core machine-readable file containing the raw financial data tags, values, and contextual dates.
EX-101.SCH.xsdXBRL Taxonomy Extension. The company-specific taxonomy schema validating the data against the standard US-GAAP taxonomy.
EX-101.CAL.xmlXBRL Calculation Linkbase. Defines the mathematical relationships between XBRL tags (e.g., ensuring that Assets = Liabilities + Equity).
EX-101.DEF.xmlXBRL Definition Linkbase. Defines dimensional structures and table architectures in the data (e.g., breaking down revenue by geographical segments).
EX-101.LAB.xmlXBRL Label Linkbase. Provides the human-readable English labels for the underlying XBRL tags so they can be rendered correctly in financial statements.
EX-101.PRE.xmlXBRL Presentation Linkbase. Dictates how the XBRL tags should be ordered, indented, and nested when rendered visually on a balance sheet.

Common Legal Exhibits (Regulation S-K)

Exhibits are strictly regulated under Item 601 of Regulation S-K and serve as high-value alternative data targets. For example, quantitative researchers frequently scrape EX-10 to find undisclosed supply chain agreements, or EX-21 to map a conglomerate's offshore tax subsidiaries.

Document Class / TypeTypical ExtensionStandard Description
EX-1.htm / .txtUnderwriting Agreements related to the public offering of securities.
EX-2.htm / .txtPlans of acquisition, reorganization, arrangement, liquidation, or succession.
EX-3.htm / .txtArticles of incorporation and corporate bylaws.
EX-4.htm / .txtInstruments defining the rights of security holders, including indentures.
EX-5.htm / .txtLegal opinion regarding the legality and validity of the securities being registered.
EX-10.htm / .txtMaterial Contracts. Highly valuable for research, including major supply agreements, real estate leases, executive compensation contracts, and credit facilities.
EX-14.htm / .txtThe company’s official code of ethics or conduct.
EX-16.htm / .txtLetter from a former accountant regarding their termination or resignation.
EX-21.htm / .txtSubsidiaries of the Registrant. A complete hierarchical list of the company’s domestic and foreign subsidiaries.
EX-23.htm / .txtConsents from auditors or legal counsel to use their opinions in the filing.
EX-31 / EX-31.1.htm / .txtSarbanes-Oxley Act Section 302 certifications by the CEO/CFO regarding the accuracy of the report.
EX-32 / EX-32.1.htm / .txtSarbanes-Oxley Act Section 906 certifications by the CEO/CFO.
EX-99 / EX-99.1.htm / .txtAdditional Exhibits. A catch-all that frequently contains heavily formatted Press Releases or Investor Presentations attached to 8-K filings.

4. Frequently Asked Questions & Edge Cases

Legacy Monolithic Filings

Filings submitted prior to the widespread adoption of HTML standardization on EDGAR (typically pre-2003) were often submitted as a single, monolithic .txt file. For these legacy filings, the Document Splitting manifest may only return a single document (Sequence 1) containing the entire text, as the original filer did not formally attach discrete exhibits or XMLs.

Amended Filings (10-K/A, 8-K/A)

When a company amends a filing, they submit a new Accession Number. The manifest for an amended filing will only contain the documents that were updated or added. It will not duplicate the exhibits from the original filing. If you need the complete set of exhibits, you must fetch the manifests for both the original Accession Number and the amended Accession Number.

Handling Media and Graphics

Unlike text documents, if you pass a filename for a GRAPHIC type (such as logo.jpg) to the Content API endpoint, the API will return the raw binary image data with the appropriate image/jpeg or image/gif content-type header. Ensure your HTTP client is configured to handle binary stream responses when querying these files.