DeepScan for automated NAS-UNIL file screening

DeepScan is an advanced tool developed by the DSBU to analyze and screen projects stored on the NAS. It scans directories within NAS projects and identifies:

  • Folders or data that are older than 10 years.
  • Personal data, based on folder names (e.g., “CV,” “recommendation letters,” “email,” “birth”). The tool uses a list of 250 predefined keywords for screening.
  • The file formats found in the scanned directories.
  • Folders size

This tool generates detailed reports, which can be sent to researchers upon request. The insights provided help researchers manage their data efficiently and ensure compliance with storage policies.

Key Features of DeepScan

  • Identify Obsolete Data: Highlights folders that are more than 10 years old, enabling researchers to clean up outdated or irrelevant data.
  • Personal Data Screening: Scans directories for personal data using a robust keyword list to ensure compliance with data privacy regulations.
  • File Format Analysis: Reports on the types of file formats found in the project, helping researchers documenting their stored data.
  • More comprehensive README file:
    Deepscan help in generating retrospective README files. This is made possible by DeepScan’s capability to analyze all file formats within the researcher’s directories. DeepScan helps identify and classify the types of data present, providing valuable insights for creating accurate and comprehensive README files.
Characteristics of Readme File

A README file is a crucial component of dataset documentation, providing essential information to ensure that the data can be accurately interpreted and utilized by others, as well as by yourself in the future. It serves to enhance the usability, reproducibility, and transparency of your dataset.
Key Elements to Include in a Data README File:
General Information:
Dataset Title: Provide a clear and descriptive title for your dataset.
Author Information: List the names, affiliations, and contact details of the principal investigator and any co-investigators.
Date of Data Collection: Specify the dates when the data was collected.
Geographical Location: If applicable, mention where the data was collected.
Keywords: Include relevant keywords to describe the data’s subject matter.
Data and File Overview:
File Descriptions: Provide a brief description of each file, including its format and purpose.
File Structure: Explain the organization of the files and any relationships between them.
File Naming Conventions: Describe the naming conventions used for files and directories.
Methodological Information:
Data Collection Methods: Detail the procedures and instruments used to collect the data.
Data Processing: Explain any processing or transformation applied to the data.
Quality Assurance: Describe steps taken to ensure data quality and integrity.
Data-Specific Information:
Variable Definitions: Define all variables, including units of measurement and possible codes.
Missing Data: Specify how missing data is represented in the dataset.
Data Formats: Indicate any specialized formats or abbreviations used.
Sharing and Access Information:
Licenses or Restrictions: State any licenses or
restrictions associated with the data.
Related Publications: Provide references to publications that use or are related to the data.
Citation Information: Offer a recommended citation for the dataset.
Best Practices:
File Format: Write the README as a plain text file (e.g., README.txt) to ensure accessibility and longevity.
Standardized Formatting: Use consistent formatting and terminology throughout the README.
Clarity and Detail: Provide sufficient detail to allow others to understand and use the data without additional assistance.
For comprehensive guidance and templates, consider consulting DSBU website.

How DeepScan Can Help Researchers

  • Data Cleaning and Curation: Researchers can use the DeepScan report to remove obsolete data and ensure that personal data is not stored on the NAS.
  • Support for Archiving: For long Term Storage (LTS), DeepScan helps identify research data suitable for preservation. It screens directories to confirm they contain only research data (no personal data) and files that are less than 10 years old. See detail for LTS procedure link
  • Genererate more comprehensive README file:
    Deepscan help in generating retrospective README files. This is made possible by DeepScan’s capability to analyze all file formats within the researcher NAS project’s directories. DeepScan helps identify and classify the types of data present, providing valuable insights for creating accurate and comprehensive README files even in the absence of the original researcher.

DeepScan for D2C

DeepScan can be used for scanning directories within the D2C section of the NAS-UNIL. DSBU can screen Researchers directories to check if they meet the criteria for:

  • File age (less than 10 years).
  • No personal data.
  • Proper file formats for archiving or cleaning.
  • By screening the D2C section, researchers can efficiently organize their data for LTS and eliminate non-compliant files. Once the screening is complete, compliant folders can be copied to a TAR folder for archiving.

By screening the D2C section, researchers can efficiently organize their data for LTS and eliminate non-compliant files. Once the screening is complete, compliant folders can be copied to LTS subdirectory for archiving.

Final Validation by DSBU for LTS

After researchers prepare their data for LTS, the DSBU performs a final control check to:

  • Verify folder compliance, naming conventions, and file formats.
  • Notify researchers of any remaining issues.
  • Retrospective README files generated by DataSquid DSBU tool (link) can be updated with DeepScan’s findings (e.g. file formats), ensuring comprehensive and consistent documentation.

These final checks ensure that the archived data meets all requirements.

How to Use DeepScan

Currently, DeepScan is deployed by the DSBU within researchers’ NAS projects upon request. If you are planning to clean up your NAS or prepare data for long-term archiving, please contact the DSBU to schedule a screening for your projects.

While we are working on developing an easy-to-use version of DeepScan for researchers in the future, the current deployment requires direct coordination with our team. Our expertise ensures accurate screening and tailored support for your archiving and data management needs.