SEC Filings Document Splitting API
SEC filings are not simple, singular files; they are massive, monolithic SGML containers. Across our database of over 18 million filings, these containers hold the primary HTML reports, machine-readable XML/XBRL data, embedded graphics, and legal exhibits all fused into one text stream. The Document Splitting API systematically dissects these containers. It provides a structured manifest of every individual file, allowing you to instantly bypass the clutter and fetch precise, discrete assets—whether you need the clean primary HTML of a 10-K, the raw calculation linkbases, or base64-decoded corporate logos.
Our Architecture: Pre-Split & Zero Latency
Extracting discrete files natively from SEC EDGAR requires downloading massive, uncompressed text archives and manually parsing outdated SGML boundaries. Our backend engine proactively splits, cleans, type-casts, and indexes every individual primary document, XML, and graphic inside the container the exact moment it is published. This guarantees instantaneous, sub-second manifest and file retrieval across our entire historical database.
1. Fetch Document Manifest
Pass the Accession Number to retrieve the pre-split manifest. The response includes the exact filenames, sequence numbers, extensions, and standardized document types for every file parsed from the filing container.
https://data.businessquant.com/secfilings/{accession}/documents?api_key={api_key}
Request Parameters
| Parameter | Description |
|---|---|
| accession |
Required
The unique Accession Number of the SEC filing container.
Format:
0000320193-23-000106 |
| api_key |
Required
Your API key for authentication and billing.
Format:
api_key=YOUR_API_KEY |
curl -X GET "https://data.businessquant.com/secfilings/0000320193-23-000106/documents?api_key=YOUR_API_KEY"
{
"metadata": {
"cik": 320193,
"ticker": "AAPL",
"accession": "0000320193-23-000106",
"total_records": 3
},
"data": [
{
"sequence": 1,
"filename": "aapl-20230930.htm",
"type": "10-K",
"description": "FORM 10-K",
"extension": "htm",
"content_url": "https://data.businessquant.com/secfilings/0000320193-23-000106/content?filename=aapl-20230930.htm&api_key=YOUR_API_KEY"
},
{
"sequence": 2,
"filename": "ex21-1.htm",
"type": "EX-21",
"description": "SUBSIDIARIES OF THE REGISTRANT",
"extension": "htm",
"content_url": "https://data.businessquant.com/secfilings/0000320193-23-000106/content?filename=ex21-1.htm&api_key=YOUR_API_KEY"
},
{
"sequence": 3,
"filename": "aapl-20230930_cal.xml",
"type": "EX-101.CAL",
"description": "XBRL TAXONOMY EXTENSION CALCULATION LINKBASE DOCUMENT",
"extension": "xml",
"content_url": "https://data.businessquant.com/secfilings/0000320193-23-000106/content?filename=aapl-20230930_cal.xml&api_key=YOUR_API_KEY"
}
]
}
2. Fetch Specific Document Content
Once you identify the exact file you need from the manifest, query the Content endpoint using the filename to retrieve the raw HTML, XML, or binary Graphic data.
https://data.businessquant.com/secfilings/{accession}/content?filename={filename}&api_key={api_key}
Request Parameters
| Parameter | Description |
|---|---|
| accession |
Required
The unique Accession Number of the SEC filing.
Format:
0000320193-23-000106 |
| filename |
Required
The exact filename mapped directly from the manifest response.
Format:
filename=aapl-20230930.htm |
| api_key |
Required
Your API key for authentication and billing.
Format:
api_key=YOUR_API_KEY |
curl -X GET "https://data.businessquant.com/secfilings/0000320193-23-000106/content?filename=aapl-20230930.htm&api_key=YOUR_API_KEY"
text/html, application/xml, or image/jpeg content-type header, unlike the rest of the JSON API.
<html>
<head>
<title>Apple Inc. - Form 10-K</title>
</head>
<body>
<p style="font-weight:bold;text-align:center;">UNITED STATES</p>
<p style="font-weight:bold;text-align:center;">SECURITIES AND EXCHANGE COMMISSION</p>
<p style="text-align:center;">Washington, D.C. 20549</p>
<p style="font-weight:bold;text-align:center;">FORM 10-K</p>
<!-- ... (Full body of the primary filing) ... -->
</body>
</html>
3. Supported Document Classes
The Document Splitting API systematically breaks down the entire SGML container. When scanning the document manifest, you will encounter the following broad categories of extracted files. Understanding these classes is crucial for building robust NLP and quantitative data pipelines.
The Primary HTML Filing & Base Media
The most valuable asset in the container is the primary HTML document. SEC EDGAR files often embed XBRL elements (<ix:nonFraction>) directly into the text, which breaks standard web scrapers. Our engine proactively strips out these inline XBRL tags, namespaces, and SGML headers to return the pure, clean HTML of the actual report.
| Document Class / Type | Typical Extension | Standard Description |
|---|---|---|
Primary HTML Filing | .htm / .html | The main body of the SEC filing (e.g., the actual 10-K, 10-Q, or 8-K text). Our engine has cleaned the raw SGML and stripped out inline XBRL data to return pure HTML. This is typically assigned Sequence 1 in the container. |
GRAPHIC | .jpg / .gif | Embedded images, charts, and logos. The engine natively decodes the base64/uuencoded streams in the SGML and returns raw binary image files. |
XML | .xml | General XML data files. This includes structured data payloads like the raw matrices found in Form 4 Insider Transactions or Form 13F Institutional Holdings. |
CORRESP | .htm / .txt | Correspondence. Uploads of correspondence between the SEC and the company regarding filing reviews or administrative matters. |
INFORMATION TABLE | .xml | Specific to 13F filings, this contains the actual XML table of the institution's equity holdings. |
XBRL / Machine-Readable Data Linkbases
If you are building custom financial data models, you can bypass the HTML entirely and fetch the raw XML linkbases. These files dictate the exact mathematical and structural relationships of every financial data point the company reported.
| Document Class / Type | Typical Extension | Standard Description |
|---|---|---|
EX-101.INS | .xml | XBRL Instance Document. The core machine-readable file containing the raw financial data tags, values, and contextual dates. |
EX-101.SCH | .xsd | XBRL Taxonomy Extension. The company-specific taxonomy schema validating the data against the standard US-GAAP taxonomy. |
EX-101.CAL | .xml | XBRL Calculation Linkbase. Defines the mathematical relationships between XBRL tags (e.g., ensuring that Assets = Liabilities + Equity). |
EX-101.DEF | .xml | XBRL Definition Linkbase. Defines dimensional structures and table architectures in the data (e.g., breaking down revenue by geographical segments). |
EX-101.LAB | .xml | XBRL Label Linkbase. Provides the human-readable English labels for the underlying XBRL tags so they can be rendered correctly in financial statements. |
EX-101.PRE | .xml | XBRL Presentation Linkbase. Dictates how the XBRL tags should be ordered, indented, and nested when rendered visually on a balance sheet. |
Common Legal Exhibits (Regulation S-K)
Exhibits are strictly regulated under Item 601 of Regulation S-K and serve as high-value alternative data targets. For example, quantitative researchers frequently scrape EX-10 to find undisclosed supply chain agreements, or EX-21 to map a conglomerate's offshore tax subsidiaries.
| Document Class / Type | Typical Extension | Standard Description |
|---|---|---|
EX-1 | .htm / .txt | Underwriting Agreements related to the public offering of securities. |
EX-2 | .htm / .txt | Plans of acquisition, reorganization, arrangement, liquidation, or succession. |
EX-3 | .htm / .txt | Articles of incorporation and corporate bylaws. |
EX-4 | .htm / .txt | Instruments defining the rights of security holders, including indentures. |
EX-5 | .htm / .txt | Legal opinion regarding the legality and validity of the securities being registered. |
EX-10 | .htm / .txt | Material Contracts. Highly valuable for research, including major supply agreements, real estate leases, executive compensation contracts, and credit facilities. |
EX-14 | .htm / .txt | The company’s official code of ethics or conduct. |
EX-16 | .htm / .txt | Letter from a former accountant regarding their termination or resignation. |
EX-21 | .htm / .txt | Subsidiaries of the Registrant. A complete hierarchical list of the company’s domestic and foreign subsidiaries. |
EX-23 | .htm / .txt | Consents from auditors or legal counsel to use their opinions in the filing. |
EX-31 / EX-31.1 | .htm / .txt | Sarbanes-Oxley Act Section 302 certifications by the CEO/CFO regarding the accuracy of the report. |
EX-32 / EX-32.1 | .htm / .txt | Sarbanes-Oxley Act Section 906 certifications by the CEO/CFO. |
EX-99 / EX-99.1 | .htm / .txt | Additional Exhibits. A catch-all that frequently contains heavily formatted Press Releases or Investor Presentations attached to 8-K filings. |
4. Frequently Asked Questions & Edge Cases
Legacy Monolithic Filings
Filings submitted prior to the widespread adoption of HTML standardization on EDGAR (typically pre-2003) were often submitted as a single, monolithic .txt file. For these legacy filings, the Document Splitting manifest may only return a single document (Sequence 1) containing the entire text, as the original filer did not formally attach discrete exhibits or XMLs.
Amended Filings (10-K/A, 8-K/A)
When a company amends a filing, they submit a new Accession Number. The manifest for an amended filing will only contain the documents that were updated or added. It will not duplicate the exhibits from the original filing. If you need the complete set of exhibits, you must fetch the manifests for both the original Accession Number and the amended Accession Number.
Handling Media and Graphics
Unlike text documents, if you pass a filename for a GRAPHIC type (such as logo.jpg) to the Content API endpoint, the API will return the raw binary image data with the appropriate image/jpeg or image/gif content-type header. Ensure your HTTP client is configured to handle binary stream responses when querying these files.