datacatalog.filetypes package

Submodules

datacatalog.filetypes.anytype module

datacatalog.filetypes.anytype.listall()[source]

Get “*” FileType

Returns:the “*” FileType, but in a list context
Return type:list

datacatalog.filetypes.filetype module

class datacatalog.filetypes.filetype.FileType(label, comment)

Bases: tuple

comment

Alias for field number 1

label

Alias for field number 0

class datacatalog.filetypes.filetype.FileTypeComment[source]

Bases: str

Verbose human-readable name for a file type

exception datacatalog.filetypes.filetype.FileTypeError[source]

Bases: ValueError

Error that occurs when working with FileTypes

class datacatalog.filetypes.filetype.FileTypeLabel[source]

Bases: str

Short, human-readable label for a file type

datacatalog.filetypes.filetype.to_file_type(mimetype=None, label=None, comment=None)[source]

datacatalog.filetypes.infer module

datacatalog.filetypes.infer.infer_filetype(filename, check_exists=False, permissive=True)[source]

Infer a file’s canonical file type

Parameters:
  • filename (str) – An absolute or relative file path
  • check_exists (bool, optional) – Verify the file exists before classifying it
  • permissive (bool, optional) – Whether to return UNKNOWN or Exception on failure

Note

Use of check_exists requires filename to be an absolute path

Raises:
  • OSError – Existence of the target file cannot be verified
  • FileTypeError – The target file’s type could not be inferred
Returns:

The type for the file

Return type:

FileType

datacatalog.filetypes.listing module

datacatalog.filetypes.listing.listall(filter_attrname=None)[source]

Lists rule- and MIME-based types, labels, or comments

Parameters:filter_attrname (str, optional) – Attribute name to extract from list
Returns:A list of FileType, FileTypeLabel, or FileTypeComment objects
Return type:list
datacatalog.filetypes.listing.listall_comments()[source]

Lists rule-based labels

Returns:A list of FileTypeComments
Return type:list
datacatalog.filetypes.listing.listall_labels()[source]

Lists rule-based labels

Returns:A list of FileTypeLabels
Return type:list

datacatalog.filetypes.mime module

datacatalog.filetypes.mime.get_type_optimized(path, follow=False)[source]
datacatalog.filetypes.mime.infer(filename)[source]

Infer the FileType for a file by MIME classifier

Parameters:filename (str) – An absolute file path
Returns:What kind of file it is
Return type:FileType
datacatalog.filetypes.mime.listall()[source]

Get all FileTypes defined by the FreeDesktop MIME database

Returns:Multiple FileType objects
Return type:list

datacatalog.filetypes.rules module

datacatalog.filetypes.rules.infer(filename)[source]

Infer the FileType for a file by rule

Parameters:filename (str) – An absolute or relative file path
Raises:FileTypeError – File did not match any of the rules
Returns:What kind of file it is
Return type:FileType
datacatalog.filetypes.rules.listall()[source]

Get all FileTypes defined by rules

Returns:Multiple FileType objects
Return type:list

datacatalog.filetypes.ruleset module

datacatalog.filetypes.ruleset.FILETYPES = [('BEDGRAPH', 'UCSC Genome Browser bedGraph format', ['.bedgraph$']), ('ABI', 'ABI Sequencer Chromatogram file', ['.ab1$', '.abi$', '.ab$', '.ab!$']), ('SCF', 'Standard Chromatogram Format file', ['.scf$']), ('LOG', 'Log file', ['.err$', '.out$', '.log$']), ('ENV', 'Environment file', ['.env$', '.rc$']), ('FASTQC', 'FASTQC outputs', ['fastqc.html$', 'fastqc.zip$']), ('MULITIQC', 'FASTQC outputs', ['multiqc_report.html$']), ('FASTA', 'FASTA sequence file', ['.fa$', '.fasta$', '.fa.gz$', '.fasta.gz$', '.fas$']), ('TSV', 'Tab-separated values (override TAB-SEPARATED-VALUES)', ['.tab$', '.tsv$']), ('BAM', 'Binary SAM', ['.bam$']), ('BAI', 'Binary SAM Index', ['.bai$']), ('VCF', 'Variant Call Format', ['.vcf$']), ('BCF', 'Binary Variant Call Format', ['.bcf$']), ('MD5', 'MD5 checksum file', ['.md5$']), ('SAM', 'Sequence Alignment/MAP', ['.sam$']), ('FASTQ', 'FASTQ sequence file', ['.fastq$', '.fastq.gz$', '.fq$', '.fq.gz$']), ('FCS', 'Flow Cytometry Standard', ['.fcs$']), ('SRAW', 'Raw proteomics file', ['.sraw$']), ('MZML', 'Proteomics mzML file', ['.mzML$']), ('MSF', 'Magellan storage file', ['.msf$']), ('SAMPLES', 'Sample Set Metadata (JSON)', ['^metadata-[a-z0-9-]+.json$']), ('BPROV', 'Biofab Provenance (JSON)', ['^provenance_dump.json$']), ('INI', 'INI config file', ['.ini$']), ('SECRETS', 'Abaco secrets file', ['^secrets.json$']), ('CONFIG', 'Configuration file', ['config.rc$', 'reactor.rc$', 'config.yml$']), ('GIT', 'Git file', ['.git']), ('JENKINS', 'Jenkins Pipeline file', ['^Jenkinsfile$']), ('DOCKERFILE', 'Docker build file', ['^Dockerfile$']), ('REQUIREMENTS', 'Python requirements file', ['^requirements.txt$']), ('COMPOSEFILE', 'Docker compose file', ['^docker-compose.yml$']), ('GFF3', 'Sequence Ontology General Feature Format', ['.gff$', '.gff3$']), ('GTF', 'Ensembl Gene Transfer Format', ['.gtf$']), ('AB1', 'ABI Sequencer Chromatogram file', ['.ab1$']), ('JPEG', 'Alias for JPEG file', ['.jpg$'])]

A list of tuples defining classifcation rules for filenames

datacatalog.filetypes.schemas module

class datacatalog.filetypes.schemas.FileTypeLabelDoc(**kwargs)[source]

Bases: datacatalog.jsonschemas.schema.JSONSchemaBaseObject

Schema document enumerating all FileTypeLabels

datacatalog.filetypes.schemas.get_schemas()[source]

Returns the filetype_label subschema

Returns:One or more schema documents
Return type:JSONSchemaCollection

datacatalog.filetypes.unknown module

datacatalog.filetypes.unknown.listall()[source]

Get unknown FileType

Returns:the Unknown FileType, but in a list context
Return type:list

datacatalog.filetypes.validate module

datacatalog.filetypes.validate.validate_label(label, permissive=True)[source]

Verify a string label is found in the known set of FileTypeLabels

Parameters:
  • label (str) – A value to check
  • permissive (bool, optional) – Whether to raise an Exception on failure
Returns:

Whether label is valid

Return type:

bool