Filetypes¶
The filetypes module provides two key functions in the Data Catalog system.
First, it enumerates over 700 distinct ways of storing structured information
in a file, with each format assigned a distinct human-readable string
identifier (i.e. RICHTEXT
or JSON
). Second, it provides a robust
mechanism for identifying the format of a file using a combination of its
name, binary signature, and in some cases, its contents.
Two methods of identification are currently implemented. The first is rules, which applies a collection of regular expressions to a filename to determine its type. This is very fast and doesn’t require physical access to the file. The second is mime, which uses the FreeDesktop.org MIME classification system for file typing. The mime method can inspect file contents, and thus may only be used where there’s a guarantee of file access.
Only extension of the rules mechanism is covered here.
Add a file type¶
Add an entry to FILETYPES
in datacatalog.filetypes.ruleset.py
following
this template: ('LOG', 'Log file', ['.err$', '.out$', '.log$'])
.
The fields are, in order:
- Label: This is the searchable ‘type’ in Data Catalog
- Description: Human-readable definition of the file type
- Patterns: List of one or more Python regular expressions for filename matching
Note that the rules are evaluated in order, with the first match being
returned. This is fast, but one must be aware of ordering conflicts when adding
new entries to FILETYPES
Test out the new rule, then open a pull request containing your improvements.
>>> filetypes.infer_filetype('captains.log')
AttrDict({'label': 'LOG', 'comment': 'Log file'})
>>> assert 'LOG' in filetypes.listall_labels()
>>> filetypes.validate_label('LOG')
True
Update a file type¶
Change the matching patterns in the select member of FILETYPES
. Test the
new behavior as outlined in the section above on adding a new type, then open
a pull request containing your improvements.