Skip to content

Indexing system is unsound #223

@Zaharid

Description

@Zaharid

We currently index/upload:

  • Validphys reports
  • fits
  • theories

and implicitly every file contained in a valdphys report.

The way it is done has many issues.

  • It is insecure because it currently uses an ssh connection with more privileges than those required to upload a file, such as to manipulate every file ever uploaded. People rarely bother with ssh-agent so the ssh keys are stored typically unencrypted.
  • It is inefficient because we only have functionality corresponding to reindex_all that scans over all the files in order to build an index, and it is triggered every time a single file is updated. This is not so negligible when we reconstuct metadata in complicated ways such as by parsing a large html file (as done for validphys reports without meta.yaml).
  • Common parts corresponding to indexing fits, theories and reports could maybe be bundled together.
  • It is un(der)documented and basically only I understand how it works.

I think ideally the indexing should have the following properties

  • Single source of truth based on the content of the files we index. There should not be a database with potentially conflicting information.

  • The basic actions are index_one and reindex_all, where index one adds one item to the index that the user sees, and reindex_all rebuilds the index from scratch.

  • The index_one action can be performed without full privileges, limiting the scope of the potential loss of credential required for uploading.

These can be more or less done with the current serverscripts layout, which is good in that it is dead simple minded but currently bad in all the ways above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    devtoolsBuild, automation and workflowserver

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions