Documentation

The WeCureUs query interface returns population statistics over data contributed by people living with multiple sclerosis. This page explains what the dataset contains and how to read the results. The full legal terms are in the Data Use Agreement.

Getting started

First, register and accept the Data Use Agreement. Registration starts an application; no Research Access Key is issued yet. An operator reviews it, and if you are approved we email you an invitation to sign in. You sign in with a one-time email link, the same passwordless method participants use, and on your first sign-in your Research Access Key is shown once in your browser, with a copy button. Save it then, because it is not shown again and is never sent by email. If you ever lose it, rotate it from your profile page. Then open the query builder, enter your Research Access Key, and build a query. Your key is held only in your browser session and is sent with each query you submit. The query builder requires a valid Research Access Key before any of the data schema or query controls are shown.

There is nothing to pay, at any stage. Aggregate query access is provided free of charge to all approved researchers and institutions, and WeCureUs does not charge, and will not charge, for access to aggregate community data. Registration is free, approval is free, and querying is free. There is no paid tier, so no researcher can obtain additional data, finer granularity, or faster access by paying for it. This is a commitment about the Corporation's mission rather than a commercial term, and it is set out in Section 2.5 of the Data Use Agreement, which also records that it is not subject to routine revision.

What the dataset contains

The dataset is fully intersectional. Every question a participant has answered, every clinical record they have contributed, and every characteristic in their profile can be used to filter any query. There is no fixed list of permitted cross-references. If two things are present in the dataset, you can ask how they relate.

The scale is substantial and growing. Participants answer hundreds of data points across questionnaire modules spanning the dimensions of living with MS: symptoms, daily function, treatment, and the diagnostic experience. They also contribute records and treatments directly, including the medications, supplements, procedures, diets, and assistive devices they have tried, along with radiology reports, lab results, and ancestry data. New modules and new participants are added over time, so the dataset is broader today than it was at launch and continues to grow.

Three types of data are queryable.

Questionnaire responses. Every question across all available modules, plus the One Shot and Twofer questions (see below). Question types are single-select, yes/no, year, numeric integer, and multi-select. Free-text responses are never aggregated or exposed.
Contributed records. Findings drawn from records and treatments participants contribute themselves, across all eight record types: medications, supplements, procedures, diet, devices, radiology findings (lesion location, contrast enhancement, overall trajectory, lesion count), lab results, and ancestry or genealogical data (including ancient steppe ancestry). Lab findings include which tests were contributed, the qualitative result of the key MS-related serology and CSF tests (Epstein-Barr VCA IgG and EBNA IgG serostatus, CSF oligoclonal bands, and ANA, each positive, negative, or equivocal), and banded numeric values (vitamin D, the CSF IgG index, and the kappa/lambda free light chain ratio). Counts over records are always counts of distinct participants, never counts of records.
Participant characteristics. Profile dimensions established during enrollment: MS subtype, diagnosis year, birth year, biological sex, gender identity, postal code, and disease-modifying therapy (DMT) status.

Any type can filter any other. A cohort filter is a constraint drawn from any of these three types, and any query can carry one or more of them. You can request the distribution of a fatigue question among participants who contributed a radiology report showing a specific lesion location. You can request a lesion-location distribution among participants who answered a cognitive symptoms question a particular way. You can narrow any of this further by participant characteristics. Multiple filters combine, and each one narrows the cohort. The narrower the cohort, the more likely a result falls below the minimum threshold and is withheld or generalized, as described below.

Building complex queries

The query builder lets you layer several kinds of filter on one query. All of them narrow the cohort, and all are combined with AND.

Cohort filters constrain participant characteristics (MS subtype, diagnosis year, and so on). Each filter offers an is / is not choice, so you can include or exclude a value, and a single filter can hold several values at once (matching any of them).
Narrow by module response (Step 4) restricts the cohort to participants whose answer to another module or One Shot matches. You can add more than one module-response filter, from different modules, and all of them must be satisfied. For example, participants who reported heat sensitivity in one module AND a significant cognitive impact in another.
Narrow by contributed records uses a two-phase sub-builder. First choose a record type (medications, procedures, supplements, diet, devices, radiology, lab results, or ancestry); an attribute panel then lists every queryable attribute of that type. Any conditions you set on one record type must all hold on the same contributed record. Setting a medication’s agent to Ocrevus and its effectiveness to “very effective” therefore means Ocrevus was rated effective, not merely taken. Each attribute has its own is / is not toggle, so you can exclude participants who have a matching record (for example, agent is not Ocrevus). Add another record type to filter on it too. This narrower is available in all four query modes, including Cross-tabulate, so a contributed-record value (such as EBV EBNA IgG positive) can be used as a cohort filter on any query, a single distribution or a cross-tabulation alike.
Lab-value and percentage bands. For banded numeric record attributes (such as vitamin D level or an ancestry percentage), the sub-builder shows the same k-anonymous bands the distribution uses as selectable chips. Select one or more bands to narrow the cohort to participants whose value falls in any of the chosen bands (for example vitamin D in 0-10 or 10-20 ng/mL, or a Yamnaya ancestry percentage in 30-40 or 40-50). The bands are read from the catalog for each dimension, so you filter on the ranges that actually exist in the data rather than typing an arbitrary number. Exact values are never used as filters or surfaced.

Because every filter narrows the cohort, a heavily filtered query is more likely to fall below the minimum-cohort threshold and be withheld or generalized. The narrowing applies before the result is computed, and the minimum-cohort rule always applies to the final cohort.

Cross-tabulation

A cross-tabulation returns a joint distribution over two dimensions at once, instead of one distribution over one dimension. For example, MS subtype by CSF oligoclonal-bands result: for every combination of a subtype and a result, the number of distinct participants who have both. It answers questions a pair of separate distributions cannot, because it shows how the two dimensions co-occur within the same people.

This build supports categorical dimensions on both axes: a single-select, yes/no, or multi-select module question, or a contributed-record dimension whose values are categories (a scalar coded field, a label list, or a qualitative result). Numeric and banded dimensions (year, numeric integer, lesion or scan counts, banded lab values, ancestry percentages, and the demographic profile dimensions) are not yet available as cross-tab axes; a two-way numeric cross-tab needs banding chosen so that every joint cell, not just each margin, meets the threshold, and that is a separate step. The same cohort, module-response, and record filters that narrow a single query define the population for a cross-tab.

Suppression is stricter than for a single distribution, because a joint cell can isolate a small group that neither single distribution reveals. Three rules apply.

Per-cell threshold. Every cell must contain at least five distinct participants (the same k=5 as everywhere else). A cell below five is suppressed and never shown.
Complementary suppression. If suppressing a small cell would leave exactly one suppressed cell in its row or its column, a second cell in that line is suppressed too, repeated until no row or column has a single suppressed cell. This prevents a suppressed cell from being recovered by subtracting the visible cells in a row from the single-dimension total for that row. A consequence is that a two-by-two table with one small cell suppresses entirely; larger tables keep most cells.
No margins. The result returns only the surviving cells (each at least five) plus a count of how many cells and participants were suppressed. Row and column totals are not returned; the only margins available are the ordinary single-dimension distributions, which are independently k-anonymized and protected by the rule above.

A cross-tab also carries the same dual denominators as every other result: cohort_size is the number of participants who have a value on both axes, and population_total is the cohort-filtered population, so you can read the joint response rate. When the whole cohort is below five the entire table is withheld.

One Shots and Twofers

Alongside the survey modules, participants answer standalone questions called One Shots and Twofers. These are single targeted questions, not part of a module sequence. A One Shot is one standalone question; a Twofer is a lead question with a follow-up shown only to participants who gave a qualifying answer to the lead, so the follow-up’s cohort is a subset of the lead question’s. In the query builder they appear in their own group, separate from the survey modules.

One Shots and Twofers are queryable exactly like module questions: as a primary dimension, and as a cross-modal filter to narrow any other query. They carry their own identifiers (each begins os_) and are used in the same request shape as a module and question. The same minimum-cohort threshold and generalization apply.

Only the actual responses participants gave are ever available. Whether a participant dismissed, snoozed, or has not yet seen a One Shot is never exposed and is not queryable.

The AI query assistant

The query builder includes an optional AI assistant. You describe a research question in plain language, and it proposes a structured query that pre-fills the query builder for you. It is a starting point you review, not a query it runs.

The assistant never runs a query and never sees any participant-level data. It reasons only over the catalog structure, the modules, questions, options, cohort dimensions, and record dimensions, and your question. It does not touch the data itself.

To use it, type a research question, ask the assistant for a proposal, review the pre-filled builder to confirm the module, question, dimension, and any filters it chose, and then click Run yourself. Nothing is queried until you run it.

It understands cross-modal questions that combine a treatment with a condition. A question like “how many people with RRMS who reported heat sensitivity found Ocrevus effective” is decomposed for you into an MS-subtype cohort filter, a heat-sensitivity module-response filter, and a named-drug records filter on the effectiveness distribution. It will also propose a same-record group filter when a question implies one record must satisfy several attributes at once.

It handles what the query engine can answer and is honest about what it cannot. It will decline requests for free-text answers, for raw numeric values or means and medians, and for anything at the individual level, because none of those are available. Every query you run still goes through the query engine with k-anonymity enforced.

How multi-select counts work

For a multi-select question, each option count is the number of distinct participants who selected that option. A participant who selected several options is counted once in each of those option counts, but only once in the cohort size. Because of this, the option counts can add up to more than the cohort size, and that is expected.

K-anonymity and suppression

Every result is subject to a minimum cohort threshold. No result is returned for a cohort smaller than the threshold (currently five participants). When a whole result is withheld, the response marks it as suppressed and tells you the threshold in effect.

Within a result, any single answer chosen by fewer participants than the threshold is folded into a combined “below threshold” bucket rather than reported on its own. This protects participants who chose rare answers while preserving the rest of the distribution.

Response rates and denominators

Every result carries two counts: cohort_size, the number of people who answered the question (or contributed the record), and population_total, the size of the cohort-filtered population. The population is everyone your filters select, whether or not they answered this particular question, so cohort_size of population_total is the response rate. This matters because a small result can mean either a small cohort or a low response rate, and those are different findings.

When the group that did not answer is itself below the threshold, the exact population is withheld: population_total is null and population_total_suppressed is true. In that case you still know the answered count, just not the precise denominator.

Generalization and precision

For some dimensions, when a cohort cannot be reported at exact precision, the system returns a coarser granularity rather than withholding the result entirely. Individual birth years may be reported as five-year or ten-year bands; numeric values may be reported as ranges.

When generalization has been applied, the result includes a precision note such as “Results reported at 5-year band granularity.” Do not treat generalized results as if they were exact-precision data.

The banded numeric dimensions are the lab-value dimensions lab_vitamin_d, lab_igg_index, and lab_kappa_lambda_ratio (the CSF IgG index and kappa/lambda free light chain ratio), and the ancestry percentage dimensions (one per population, such as ancestry_yamnaya_percentage). Raw values and means are never surfaced. Both as a queryable distribution and as a cohort filter, results are banded, and k=5 is enforced on every band, so a band with fewer than five participants is widened or suppressed, never shown.

The procedure_country dimension uses automatic country-to-region generalization. Exact countries are returned when every country cohort meets the threshold. If any country falls below the threshold, the whole distribution coarsens to region. If a region is still below the threshold, it is suppressed. This is what makes the country and procedure-by-country cross-tabs safe to offer.

Citing WeCureUs data

All published work that uses WeCureUs aggregate data must include attribution. The Data Use Agreement specifies the required language. At minimum, your citation should state that the data is sourced from the WeCureUs platform, is self-reported and not independently verified, and is subject to k-anonymity protections with the minimum cohort threshold in effect at the time of your query.

Suggested form:

Data sourced from the WeCureUs patient-driven health information platform (wecureus.com). Aggregate results are subject to k-anonymity protections with a minimum cohort threshold of [threshold at time of query]. Results represent self-reported participant data and have not been independently verified against clinical records.

See the Data Use Agreement for the full attribution, notification, and limitations requirements.