Skip to content

Latest commit

 

History

History
208 lines (179 loc) · 7.1 KB

File metadata and controls

208 lines (179 loc) · 7.1 KB

Elasticsearch generator

WARNING: this generator has been tested on Elasticsearch v1.7.5. Subsequent versions may require different mapping and queries.

General description

The Elasticsearch generator allows a flexible use of Elasticsearch to keep track of the state of ResourceSync resources. Data regarding resources and their changes must be recorded in an index, continuously updated about their state. This index will contain two different document types:

  • resource: keeps track of the current state of the resources to be sync
  • change: logs changes of the resources to be sync

Every time a resource R is updated (i.e. new resource => change = ‘created’), two documents need to be posted:

  • resource document:
    • If change == ‘created’: create new document with R’s metadata
    • If change == ‘updated’: update R’s metadata
    • If change == ‘deleted’: delete R’s document in ES
  • change document: create a new document logging the occurred change

In this way, the resource type will always contain a snapshot of the current state of the resources, in order to easily generate a resource list from it. Likewise, change lists can be created and updated querying Elasticsearch, providing a time interval to retrieve the changes we are interested in. The ResourceSync source will just refer to the Elasticsearch index as reference for the resources' state.

Why is this useful?

ResourceSync is a flexible and powerful tool to synchronize very large sets of resources, which may be physical files or not. Elasticsearch or other data storage systems, assisted by an update layer on top of them, allows to keep track of the state of the resources without time consuming processes (i.e. checking changes on several million files). Moreover, this enables more sophisticated pagination techniques, avoiding to regenerate a whole resourcelist when few changes have occurred (i.e. it may be sufficient to regenerate a single sitemap instead of 50k)

Generator parameters

resource_set: each resource can belong to different sets (i.e. 'foo')

resource_root_dir: resources root directory (i.e. /home/foo) see location in the mapping docs

elastic_host: ES host (i.e. localhost)

elastic_port: ES port (i.e. 9200)

elastic_index: the ES index name

elastic_resource_doc_type: the ES type for resource documents

elastic_change_doc_type: the ES type for change documents

strategy: same as the executor parameter

url_prefix: same as the executor parameter

max_items_in_list: same as the executor parameter

NOTES:

  • the strategy parameter is needed by the generator in order to allow it to generate both resourcelists and changelists.
  • the resource_root_dir and url_prefix parameters are needed by the generator in order to handle the flexibility of the location object
  • the max_items_in_list parameter is needed by the generator in order to overcome the scan-and-scroll limit of 10K hits

Elasticsearch mappings

resource type

Here's the mapping for the resource document type:

{
  "resource": {
    "properties": {
      "resource_set": {
        "type": "string",
        "index": "not_analyzed"
      },
      "location": {
        "type": "nested",
        "properties": {
          "value":{
            "type":"string",
            "index":"not_analyzed"
          },
          "type":{
            "type":"string",
            "index":"not_analyzed"
          }
        }
      },
      "length": {
        "type": "integer",
        "index": "not_analyzed"
      },
      "md5": {
        "type": "string",
        "index": "not_analyzed"
      },
      "mime": {
        "type": "string",
        "index": "not_analyzed"
      },
      "lastmod": {
        "type": "date",
        "format": "yyyy-MM-dd'T'HH:mm:ssZ"
      },
      "ln": {
        "type": "nested",
        "index_name": "link",
        "properties": {
          "href": {
            "type": "nested",
            "properties": {
              "value":{
                "type":"string",
                "index":"not_analyzed"
              },
              "type":{
                "type":"string",
                "index":"not_analyzed"
              }
            }
          },
          "rel": {
            "type": "string",
            "index": "not_analyzed"
          },
          "mime": {
            "type": "string",
            "index": "not_analyzed"
          }
        }
      },
      "timestamp": {
        "type": "date",
        "format": "yyyy-MM-dd'T'HH:mm:ssZ"
      }
    }
  }
}

For each resource, the following fields will be filled out:

  • resource_set: the name of the resource set the resource will belong to
  • timestamp: timestamp automatically generated by Elasticsearch when the document is created/updated
  • location: can be a
    • url: complete resource address, the url_prefix parameter won't be used
    • abs_path: absolute path, which will be resolved wrt the resource_root_dir parameter and then attached to the url_prefix
    • rel_path: relative path, which will be attached to the url_prefix
  • length: length of the resource
  • md5: md5 hash of the resource (may become an array of hashes to support different hashing techniques
  • mime: mime type of the resource
  • lastmod: last modification time of the resource
  • ln: links to other resources, each one of them can have three fields
    • rel: relationships description (i.e. describes, described by)
    • href: link to the resource, similar to location (url/abs_path/rel_path)
    • mime: mime type of the linked resource

change type

Here's the mapping for the change document type:

{
  "change": {
    "properties": {
      "resource_set": {
        "type": "string",
        "index": "not_analyzed"
      },
      "location": {
        "type": "nested",
        "properties": {
          "value":{
            "type":"string",
            "index":"not_analyzed"
          },
          "type":{
            "type":"string",
            "index":"not_analyzed"
          }
        }
      },
      "change": {
        "type": "string",
        "index": "not_analyzed"
      },
      "lastmod": {
        "type": "date",
        "format": "yyyy-MM-dd'T'HH:mm:ssZ"
      },
      "timestamp": {
        "type": "date",
        "format": "yyyy-MM-dd'T'HH:mm:ssZ"
      }
    }
  }
}
  • resource_set: the name of the resource set the resource will belong to
  • timestamp: timestamp automatically generated by Elasticsearch when the document is created/updated
  • location: can be a
    • url: complete resource address, the url_prefix parameter won't be used
    • abs_path: absolute path, which will be resolved wrt the resource_root_dir parameter and then attached to the url_prefix
    • rel_path: relative path, which will be attached to the url_prefix
  • change: type of the occurred change, can be created/updated/deleted
  • lastmod: last modification time of the resource

Note: the current mapping will be extended with further metadata and updated according to new versions of the ResourceSync specification