tools

dimspy.tools.process_scans(source: str, function_noise: str, snr_thres: float, ppm: float, min_fraction: Optional[float] = None, rsd_thres: Optional[float] = None, min_scans: int = 1, filelist: Optional[str] = None, skip_stitching: bool = False, remove_mz_range: list = None, ringing_thres: Optional[float] = None, filter_scan_events: Dict = None, report: Optional[str] = None, block_size: int = 5000, ncpus: int = None)[source]

Extract, filter and average spectral data from input .RAW or .mzML files and generate a single mass spectral peaklist (object) for each of the data files within a directory or defined in the ‘filelist’ (if provided).

Warning

When using .mzML files generated using the Proteowizard tool, SIM-type scans will only be treated as spectra if the ‘simAsSpectra’ filter was set to true during the conversion process: msconvert.exe example.raw –simAsSpectra –64 –zlib –filter “peakPicking true 1-”

Parameters
  • source – Path to a set/directory of .raw or .mzML files

  • function_noise

    Function to calculate the noise from each scan. The following options are available:

    • median - the median of all peak intensities within a given scan is used as the noise value.

    • mean - the unweighted mean average of all peak intensities within a given scan is used as the noise value.

    • mad (Mean Absolute Deviation) - the noise value is set as the mean of the absolute differences between peak intensities and the mean peak intensity (calculated across all peak intensities within a given scan).

    • noise_packets - the noise value is calculated using the proprietary algorithms contained in Thermo Fisher Scientific’s msFileReader library. This option should only be applied when you are processing .RAW files.

  • snr_thres – Peaks with a signal-to-noise ratio (SNR) less-than or equal-to this value will be removed from the output peaklist.

  • ppm – Maximum tolerated m/z deviation in parts per million.

  • min_fraction – A numerical value from 0 to 1 that specifies the minimum proportion of scans a given mass spectral peak must be detected in, in order for it to be kept in the output peaklist. Here, scans refers to replicates of the same scan event type, i.e. if set to 0.33, then a peak would need to be detected in at least 1 of the 3 replicates of a given scan event type.

  • rsd_thres – Relative standard deviation threshold - A numerical value equal-to or greater-than 0. If greater than 0, then peaks whose intensity values have a percent relative standard deviation (otherwise termed the percent coefficient of variation) greater-than this value are excluded from the output peaklist.

  • min_scans – Minimum number of scans required for each m/z window or event within a raw/mzML data file.

  • filelist

    A tab-delimited text file containing filename and classLabel information for each experimental sample. These column headers MUST be included in the first row of the table. For a standard DIMS experiment, users are advised to also include the following additional columns:

    • injectionOrder - integer values ranging from 1 to i, where i is the total number of independent injections performed as part of a DIMS experiment. e.g. if a study included 20 samples, each of which was injected as four independent replicates, there would be at least 20 * 4 injections, so i = 80 and the range for injection order would be from 1 to 80 in steps of 1.

    • replicate - integer value from 1 to r, indicating the order in which technical replicates of each study sample were injected in to the mass spectrometer, e.g. if study samples were analysed in quadruplicate, r = 4 and integer values are accordingly 1, 2, 3, 4.

    • batch - integer value from 1 to b, where b corresponds to the total number of batches analysed under define analysis conditions, for any given experiment. e.g. : if 4 independent plates of polar extracts were analysed in the positive ionisation mode, then valid values for batch are 1, 2, 3 and 4.

    This filelist may include additional columns, e.g. additional metadata relating to study samples. Ensure that columns names do not conflict with existing column names.

  • skip_stitching – Selected Ion Monitoring (SIM) scans with overlapping scan ranges can be “stitched” together in to a pseudo-spectrum. This is achieved by setting this parameter to False (default).

  • remove_mz_range – This option allows for specific m/z regions of the output peaklist to be deleted, this option may be useful for removing sections of a spectrum known to correspond to system noise peaks.

  • ringing_thres – Fourier transform-based mass spectra often contain peaks (ringing artefacts) around spectral features that require removal. This threshold is a positive float indicating the required relative intensity a peak must exceed (with reference to the largest peak in a cluster of peaks) in order to be retained.

  • filter_scan_events

    Include or exclude specific scan events, by default all ALL scan events will be included. To include or exclude specific scan events use the following format of a dictionary.

    >>> {"include":[[100, 300, "sim"]]} or {"include":[[100, 1000, "full"]]}
    

  • report – A tab-delimited text file to write measures of quality (e.g. RSD, number of peaks, etc) for each scan event processed in each .RAW or .mzML files.

  • block_size – Number peaks in each centre clustering block.

  • ncpus – Number of CPUs for parallel clustering. Default = None, indicating using all CPUs that are available

Returns

List of peaklist objects

dimspy.tools.replicate_filter(source: Union[Sequence[dimspy.models.peaklist.PeakList], str], ppm: float, replicates: int, min_peaks: int, rsd_thres: Optional[float] = None, filelist: Optional[str] = None, report: Optional[str] = None, block_size: int = 5000, ncpus: int = None)[source]

Peaks from each technical replicate (for a given study sample) are aligned using a one-dimensional hierarchical clustering procedure (applied on the mass-to-charge level). Peaks are aligned only if the difference in their mass-to-charge ratios, when divided by the average of their mass-to-charge ratios and multiplied by 1 × 106 (i.e. when measured in units of parts-per-million, ppm), is less-than or equal-to the user-defined ‘ppm error tolerance’. After alignment, a set of user-defined filters are applied to retain only those peaks that:

  • occur in equal-to or more-than the user-defined ‘Number of technical replicates a peak has to be present in’, i.e. if set to 2, then a peak must be detected in at least two of the replicate analyses, and/or

  • have relative standard deviation (measured in %; may otherwise be referred to as the percent coefficient of variation) of intensity values, across technical replicates, that is equal-to or less-than the user-defined ‘relative standard deviation threshold’ (if defined, otherwise ignored).

Warning

When the parameter “number of technical replicates for each sample” is set to a value less-than the total number of technical replicates actually acquired for each study sample, this tool will automatically determine which combination of technical replicates to combine. See the parameter description (below) for further details.

Parameters
  • source – A list of processed peaklist objects generated by ‘process_scans’ or path to .hdf5 file

  • ppm – Maximum tolerated m/z deviation in parts per million.

  • replicates – Number of technical replicates for each sample - the total number of technical replicates acquired for each study sample. This value must be set to the lowest number of technical replicates acquired for ANY of the study samples, or alternatively, may be set to the minimum number of replicates the user would like to select from the total number of technical replicates for a biological sample.

  • min_peaks

    Minimum number of technical replicates a peak has to be present in. For a given biological sample, the number of replicates that will be used to generate the replicate-filtered peaklist. If this parameter is set to a value less-than the total number of technical replicates acquired for each biological sample, it will automatically determines which combination of technical replicates yields the best overall rank. Otherwise, all technical replicates are used. Ranking of the combinations of technical replicates is based on the average of the following three scores:

    • score 1: peak count / peak count present in n-out-n (e.g. 3-out-of-3)

    • score 2: peak count present in x-out-of-n (e.g. 3-out-of-3) / MAX peak count present in x-out-of-n across sets of replicates

    • score 3: RSD categories (0-5 (score=1.0), 5-10 (score=0.9), 10-15 (score=0.8), etc)

  • rsd_thres – Relative standard deviation threshold - a numerical value from 0 upwards that defines the acceptable percentage relative standard deviation (otherwise termed the percent coefficient of variation) of a peak’s intensity across technical replicates. Peaks are removed from the output ‘replicate-filtered’ peaklist if this condition is not met. Set to None to skipe this filter.

  • filelist

    A tab-delimited text file containing filename and classLabel information for each experimental sample. There is no need to provide a filelist again if this has been done already as part of one of the previous processing steps (i.e. see process scans or replicate filter) - except if specific samples need to be excluded. These column headers MUST be included in the first row of the table. For a standard DIMS experiment, users are advised to also include the following additional columns:

    • injectionOrder - integer values ranging from 1 to i, where i is the total number of independent injections performed as part of a DIMS experiment. e.g. if a study included 20 samples, each of which was injected as four independent replicates, there would be at least 20 * 4 injections, so i = 80 and the range for injection order would be from 1 to 80 in steps of 1.

    • replicate - integer value from 1 to r, indicating the order in which technical replicates of each study sample were injected in to the mass spectrometer, e.g. if study samples were analysed in quadruplicate, r = 4 and integer values are accordingly 1, 2, 3, 4.

    • batch - integer value from 1 to b, where b corresponds to the total number of batches analysed under define analysis conditions, for any given experiment. e.g. : if 4 independent plates of polar extracts were analysed in the positive ionisation mode, then valid values for batch are 1, 2, 3 and 4.

    This filelist may include additional columns, e.g. additional metadata relating to study samples. Ensure that columns names do not conflict with existing column names.

  • report – A tab-delimited text file to write measures of quality (e.g. RSD, number of peaks, etc) for each processed ‘replicate-filtered’ peaklist.

  • block_size – Number peaks in each centre clustering block.

  • ncpus – Number of CPUs for parallel clustering. Default = None, indicating using all CPUs that are available

Returns

List of peaklist objects

dimspy.tools.align_samples(source: Union[Sequence[dimspy.models.peaklist.PeakList], str], ppm: float, filelist: Optional[str] = None, block_size: int = 5000, ncpus: int = None)[source]

Study samples (i.e. PeakList Objects) are aligned to create PeakMatrix object. The PeakMatrix object comprises of a table, with samples along one axis and the mass-to-charge ratios of detected mass spectral peaks along the opposite axis. At the intersection of sample and mass-to-charge ratio, the intensity is given for a specific peak in a specific sample (if no intensity recorded, then ‘nan’ is inserted).

Parameters
  • source – A list of processed peaklist objects generated by ‘process_scans’ and/or ‘replicate_filter’, or path to .hdf5 file.

  • ppm – Maximum tolerated m/z deviation in parts per million.

  • filelist

    A tab-delimited text file containing filename and classLabel information for each experimental sample. There is no need to provide a filelist again if this has been done already as part of one of the previous processing steps (i.e. see process scans or replicate filter) - except if specific samples need to be excluded. These column headers MUST be included in the first row of the table.

    This filelist may include additional columns, e.g. additional metadata relating to study samples. Ensure that column names do not conflict with existing column names.

  • block_size – Number peaks in each centre clustering block.

  • ncpus – Number of CPUs for parallel clustering. Default = None, indicating using all CPUs that are available

Returns

PeakMatrix object

dimspy.tools.blank_filter(peak_matrix: Union[dimspy.models.peak_matrix.PeakMatrix, str], blank_label: str, min_fraction: float = 1.0, min_fold_change: float = 1.0, function: str = 'mean', rm_samples: bool = True, labels: Optional[str] = None)[source]
Parameters
  • peak_matrix – PeakMatrix object

  • blank_label – Label for the blank samples - a string indicating the name of the class to be used for filtering (e.g. blank), i.e. the “reference” class. This string must have been included in the “classLabel” column of the metadata file associated with the process_sans or replicate_filter function(s).

  • min_fraction – A numeric value ranging from 0 to 1. Setting this value to None or 0 will skip this filtering step. A value greater than 0 requires that for each peak in the peak intensity matrix, at least this proportion of non-reference samples have to have an intensity value that exceeds the product of: (A) the average intensity of “reference” class intensities and (B) the user-defined “min_fold_change”. If this condition is not met, the peak is removed from the peak intensity matrix.

  • min_fold_change – A numeric value from 0 upwards. When minimum fraction filtering is enabled, this value defines the minimum required ratio between the intensity of a peak in a “non-reference” sample and the average intensity of the “reference” sample(s). Peaks with ratios exceeding this threshold are considered to have been reliably detected in a “non-reference” sample.

  • function

    Function to calculate the ‘reference’ intensity

    • mean - corresponds to using the non-weighted average of “reference” sample peak intensities (NA values are ignored) in calculating the “reference” to “non-reference” peak intensity ratio.

    • median - corresponds to using the median of “reference” sample peak intensities (NA values are ignored) in calculating the “reference” to “non-reference” peak intensity ratio.

    • max corresponds to the use of the maximum intensity among “reference” sample peak intensities (NA values are ignored) in calculating the “reference” to “non-reference” peak intensity ratio.

  • rm_samples – Remove blank samples from the output peak matrix: * True - samples belonging to the user-defined “reference” class are removed from the output peak matrix * False - samples belonging to the user-defined “reference” class are retained in the output peak matrix.

  • labels – Path to the metadata file

Returns

PeakMatrix object

dimspy.tools.sample_filter(peak_matrix: Union[dimspy.models.peak_matrix.PeakMatrix, str], min_fraction: float, within: bool = False, rsd_thres: Optional[float] = None, qc_label: Optional[str] = None, labels: Optional[str] = None)[source]

Removes peaks from the input PeakMatrix object (or .hdf5 file that were detected in fewer-than a user-defined minimum number of study samples.

There are many and varied reasons why a peak may not have been detected in all study samples, including:
  • due to having an intensity (concentration) close to the signal-to-noise limit of the system;

  • due to having been present in only one of the study classes (e.g. a drug administered to the ‘treatment’ class samples);

  • due to ion suppression/enhancement effects in the mass spectrometer source region; etc.

Parameters
  • peak_matrix – PeakMatrix object or path to .hdf5 file

  • min_fraction – Minimum fraction - a numeric value between 0 and 1 indicating the proportion of study samples in which a peak must have a recorded intensity value in order for it to be retained in the output peak intensity matrix; e.g. 0.5 means that at least 50% of samples (whether assessed across all classes, or within each class individually) must have a recorded intensity value for a specific peak in order for it to be retained in the output peak matrix.

  • within

    Apply sample filter within each sample class

    • False - check across ALL classes simultaneously whether greater-than the user-defined “Minimum fraction” of samples contained an intensity value for a specific mass spectral peak.

    • True - check within EACH class separately whether greater-than the user-defined “Minimum fraction” of samples contained an intensity value for a specific mass spectral peak.

    Warning

    if in ANY class a peak is detected in greater-than the user-defined minimum fraction of samples, then the peak is retained in the output peak matrix. For classes in which this condition is not met, the peak intensity recorded for that peak (if any) will still be presented in the output peak matrix. If no peak intensity was recorded in a sample, then a ‘0’ is inserted in to the peak matrix.

  • rsd_thres – Relative standard deviation threshold - A numerical value equal-to or greater-than 0. If greater than 0, then peaks whose intensity values have a percent relative standard deviation (otherwise termed the percent coefficient of variation) greater-than this value are excluded from the output PeakMatrix object.

  • qc_label – Label for the QC samples - a string indicating the name of the class to be used for filtering, i.e. the “reference” class. This string must have been included in the “classLabel” column of the metadata file associated with the process_sans or replicate_filter function(s).

  • labels – Path to a metadata file

Returns

PeakMatrix object

dimspy.tools.missing_values_sample_filter(peak_matrix: dimspy.models.peak_matrix.PeakMatrix, max_fraction: float)[source]

Removes study samples with greater-than a user-defined “Maximum percentage of missing values” from the peak intensity matrix. A missing value is defined as the absence of a recorded peak intensity value for a specific mass spectral peak, in a specific study sample.

Samples with large numbers of missing values are often observed where a failed mass spectral acquisition has occurred, the reasons for which are many and diverse.

Parameters
  • peak_matrix – PeakMatrix object

  • max_fraction

    Maximum percentage of missing values (REQUIRED; default = 0.8) - a numeric value ranging

    from 0 to 1 (decimal representation of percentage), where:

    • A value of 0 (i.e. 0%) corresponds to a very harsh filtering procedure, in which only those samples with zero missing values are retained in the output peak matrix.

    • A value of 1 (i.e. 100%) corresponds to a very liberal filtering procedure, in which samples with as many as 100% missing values will be retained in the output peak matrix.

Returns

PeakMatrix object

dimspy.tools.remove_samples(obj: Union[dimspy.models.peak_matrix.PeakMatrix, Sequence[dimspy.models.peaklist.PeakList]], sample_names: list)[source]

Remove samples from a PeakMatrix or list of PeakLists

Parameters
  • obj – PeakMatrix object or List of PeakList objects

  • sample_names – List of sample names (Peaklist IDs)

Returns

PeakMatrix object or List of Peaklist Objects

dimspy.tools.hdf5_peak_matrix_to_txt(filename: str, path_out: str, attr_name: str = 'intensity', rsd_tags: tuple = (), delimiter: str = '\t', samples_in_rows: bool = True, comprehensive: bool = False, compatibility_mode: bool = False)[source]

Converts a .hdf5 file, containing a peak intensity matrix, to an user-friendly .tsv (tab-separated values) file.

Parameters
  • filename – Path to the .hdf5 file to read from.

  • path_out – Path to a text file to write to.

  • attr_name – The Peak Matrix should contain Intensity|m/z|SNR| values

  • rsd_tags – Calculate RDS values for the following sample classes (e.g. QC, control)

  • delimiter – Values on each line of the file are separated by this character.

  • samples_in_rows – Should the rows or columns represent the samples?

  • comprehensive – Comprehensive Peak Matrix (e.g. m/z and intensity, rsd, missing values).

  • compatibility_mode – Set to True to read .hdf5 files from dimspy < v2.0 exported .hdf5 files

dimspy.tools.hdf5_peaklists_to_txt(filename: str, path_out: str, delimiter: str = '\t', compatibility_mode: bool = False)[source]

Converts a .hdf5 file, containing a list peaklists, to user-friendly .tsv (tab-separated values) files.

Parameters
  • filename – Path to the .hdf5 file to read from.

  • path_out – Path to directory to write to.

  • delimiter – Values on each line of the file are separated by this character.

  • compatibility_mode – Set to True to read .hdf5 files exported using dimspy < v2.0.

dimspy.tools.merge_peaklists(source: Sequence[dimspy.models.peaklist.PeakList], filelist: Optional[str] = None)[source]

Extracts and exports specific PeakList object from one or more list or one or more .hdf5 files, to one or more lists or .hdf5 files. If more-than one .hdf5 file is exported, users can control which subset of peaklists are exported to which list.

Parameters
  • source – List or tuple of Peaklist objects, or .hdf5 files

  • filelist

    A tab-delimited text file containing metadata to determine which peaklists are exported together:

    Example of a filelist - the optional multilist column determines which peaklists are exported together.

    filename

    classLabel

    replicate

    batch

    injectionOrder

    multilist

    […]

    sample_rep1.raw

    sample

    1

    1

    1

    1

    […]

    sample_rep2.raw

    sample

    2

    1

    2

    1

    […]

    sample_rep3.raw

    sample

    3

    1

    3

    1

    […]

    sample_rep4.raw

    sample

    4

    1

    4

    1

    […]

    blank_rep1.raw

    blank

    1

    1

    5

    2

    […]

    blank_rep2.raw

    blank

    2

    1

    6

    2

    […]

    blank_rep3.raw

    blank

    3

    1

    7

    2

    […]

    blank_rep4.raw

    blank

    4

    1

    8

    2

    […]

    […]

Returns

Nested lists of Peaklist objects (e.g. [[pl_01, pl_02], [pl_03, pl_04, pl05]]

dimspy.tools.partition(alist: list, indices: list)[source]

Divide separated lists into nested sublists

Parameters
  • alist – List

  • indices – Indices

Returns

Nested List

dimspy.tools.load_peaklists(source: Sequence[dimspy.models.peaklist.PeakList])[source]

Load a set of processed PeakLists

Parameters

source – list of Peaklist objects, .hdf5 file, or path to a directory

Returns

List of Peaklist Objects

dimspy.tools.create_sample_list(source: Union[Sequence[dimspy.models.peaklist.PeakList], dimspy.models.peak_matrix.PeakMatrix], path_out: str, delimiter: str = '\t')[source]

Create a sample list based on a existing list of PeakList Objects or PeaMatrix Object.

Parameters
  • source – List of PeakList objects or PeakMatrix object

  • path_out – Path to a text file text file to write to.

  • delimiter – Values on each line of the file are separated by this character.