The platform provides tools and APIs to import content:
- From tens of documents to billions of them, with import rates such as 10.000 documents per second
- With or without metadata
- Handling security, lifecycle and other system properties if necessary.
Those tools natively handle several formats to specify document properties:
- XML (one XML file per document or one for all documents)
- CSV (one line per document)
- Properties file (one property file per file to import for instance, or per folder)
Importing content in the platform means:
- Creating a document that will store the metadata and reference the binaries
- Importing the binaries (when there are some; some projects do not handle files, just business objects)
There can be several strategies to create documents:
Use the REST API: The simplest strategy. Can be done remotely as long as there is an HTTP access. The less performant, although proven rates of thousands of documents per seconds can be reached
Use the Java API server-side: Transactional, multi-theaded and highly performant. It provides the ability to disable events processing and to bundle event processing.
Fill in the database directly (SQL Scripts, MongoDB collections, ...)
There are several strategies to upload the binary content (the files):
- Using the REST API: The REST API provides the batch endpoint to upload content, with ability to upload binaries by chunks and thus implement resume upload patterns. Network is an element to take into account when using this strategy.
- Uploading them on a file system accessible from the Nuxeo server: No network limitation as files may then be just "moved" to the right place.
- Moving the file right to the place it will then be stored by Nuxeo: May it be a file system binary store, an S3 binary store or an Azure Object store, it is always possible to drop the files at the right place to restrict operations. Most efficient way to do, especially when import is about very large multi-terabytes files.
The node.js importer makes use of the REST API and provides you with additional services compared to the bare approach:
- Client-side browsing of a complete hierarchy of content (folders, subfolders and files)
The Nuxeo Bulk Document Importer is an importer framework provided as an addon that can be used to build custom importers. It relies on a standard crawler, transformer, writer schema. The Scan importer and CSV importer addons are using that framework (see next sections). It is the de-facto choice when you want to reach hyperscale numbers with importing content (up to 10 000s of documents per second). All you need to do is write your own Document Factory that will be in charge of the document creation logic in the repository. You can then easily launch the import controlling how many documents are done in a batch, how many batches per transaction, etc.
The Nuxeo Platform Scan Importer is a submodule of the importer framework and is typically used for the output of a digitalization chain. Nuxeo Platform Scan Importer listens to a given folder and will import all content referenced via XML files, with their metadata, etc. Scan importer also offers very advanced XML <--> documents mapping possibilities, with ability to use some automation processing during the import phase.
Nuxeo CSV makes use of the importer framework and provides a UI to upload a CSV file whose content will be used to map columns values to properties of created documents.
You can see how it works in this video:
You can straightly use the REST API and implement the importing logic you need from there. You can use the CoreSession object in a server-side deployed custom Java component and implement the importing logic you need from there. We also provide a default import/export format for the repository with piping logic.
Allows you to upload a CSV file whose content will be used to map columns values to properties of created documents.
This is the method used in Aktua.