Refine a List of Protein Complexes by Size and Redundancy
Source:R/refineComplexList.R
refineComplexList.RdThis function filters complexes by size and merges highly redundant ones.
Usage
refineComplexList(
complexList,
minSize = 3,
maxSize = 500,
mergeThreshold = 0.9,
similarityMethod = "jaccard",
verbose = TRUE
)Arguments
- complexList
A named list where each element is a character vector of protein identifiers representing a complex.
- minSize
Minimum complex size. Defaults to 3.
- maxSize
Maximum complex size. Defaults to 500.
- mergeThreshold
Numeric (0-1). Threshold for merging. Defaults to **0.90** (strict) to preserve distinct variants.
- similarityMethod
Metric for merging. Defaults to **"jaccard"**. Options: * `"jaccard"`: (Default) `Intersection / Union`. Penalizes size differences. Best for preserving specific variants. * `"overlap"`: `Intersection / Min(A,B)`. Merges subsets into parents. Use only if you want to aggressively reduce data. * `"matching_score"`: `Intersection^2 / (SizeA * SizeB)`. Geometric approach. * `"dice"`: `2 * Intersection / (SizeA + SizeB)`.
- verbose
Logical.
Value
A list containing: * `refinedComplexes`: The final list of merged complexes. * `mergeMap`: A tibble mapping original IDs to final IDs.
Details
**Systems Biology Rationale:** To preserve the functional diversity of the input landscape, this function defaults to a conservative **Jaccard** similarity with a high threshold (0.90).
* **Trust the Input:** We assume the upstream clustering method identified biologically relevant variants (e.g., a complex with vs. without a regulatory subunit). * **Minimal Merging:** We only merge complexes if they are effectively synonyms (nearly identical composition).
We discourage using "overlap" or "matching_score" for this step, as they tend to merge sub-complexes into parents, destroying the specific functional signals you wish to visualize.