Research platform · 2026

How do we measure reading when readers are online

ReadLab is a configurable, browser-based platform for oral reading fluency, used both as an assessment tool that captures and analyses recordings end-to-end, and as a training tool: researchers configure trials, rounds, and scheduled intervals between them, then watch each participant's reading change over time. The pipeline is language-aware; the current rule tuning is deepest for Persian, with English running through the same analysis path.

65
CSV parameters
10
pipeline stages
6
token labels
Trial 03 · English · 220 WPM recording
Voice activity · live onset 412ms · offset 6.8s
/01
Participant experience

A reading session, end to end.

Almost everything about a session is configurable: presentation mode, animation unit, font and size, text colour, layout and box width, reveal speed, the number of rounds and the scheduled gap between them, which passages each round draws from, and the rules that compute speed and accuracy. The participant simply opens a link and reads. The flow below is what they actually see.

/ 01 · LOGIN

Code or URL

Participants enter a code, or arrive on a Prolific URL that auto-binds them to a study. The session resumes if interrupted.

P-4F2A
/ 02 · PRE-FLIGHT

Microphone check

A short calibration checks signal level and clipping, and records the noise floor for context. Participants who fail three attempts are blocked from continuing.

/ 03 · READING

Reveal or static

Two presentation modes. Reveal animates word, phrase, line, or sentence at a configured pace. Static shows the page at once; no-scroll sessions split long passages into screen-sized pages.

صَبر  و  دقت  پایه
/ 04 · CAPTURE

Voice recording

Audio is recorded during the reading page. Energy is tracked client-side, while the audio and trial package are uploaded for server analysis when the page ends.

/ 05 · ADVANCE

Space to continue

The participant presses space to move on. Multi-day delays between rounds are enforced by the scheduler. The next round simply will not open until the configured interval has passed.

SPACE
/ MODE A

Reveal animation

Words appear one at a time, or by phrase, line, or sentence, at a speed the researcher configures. Each unit is hidden until its turn, then becomes visible. The renderer keeps its own clock; the timing never depends on speech-to-text.

Patience and precision in every word are the root of literacy.
word-by-word reveal 220 WPM · auto-pace
/ MODE B

Static, viewport-fit

The passage is shown all at once when it fits. Longer no-scroll passages are split into pages sized for the participant's screen, avoiding scrolling and hidden text while still allowing natural reading.

page 1 / 2
Patience and precision in every word are the root of literacy. Accurate reading takes practice, and every language sets its own bar.
page 2 / 2
Research without precise measurement is impossible. Every datum we collect must be inspectable, adjustable, and traceable.
no-scroll, viewport-fit auto-paginated
/02
The engine

A ten-stage pipeline from raw audio to a reviewable trial.

When the researcher processes a recording, every stage runs in order and writes durable outputs for review. Each step has a single, narrow job, and the researcher can inspect the trial record and its detailed analysis payload.

01
Upload

Audio captured in the browser is sent to storage and registered against the trial.

02
Transcribe

Speech-to-text produces a word sequence with millisecond timestamps for each token.

03
Detect speech

Voice activity detection finds first speech, last speech, voiced regions, and long pauses.

04
Normalize

Each word is folded into a canonical form so spelling variants and diacritics do not block matching.

05
Align

Reference and transcript are matched word-by-word using a weighted edit-distance algorithm.

06
Label

Each aligned token receives one of six labels through a strict five-step cascade.

07
Adjudicate

When enabled, selected low-confidence tokens are sent to an AI adjudicator for a second opinion.

08
Duration

Effective reading duration is computed under the configured profile (exposure-based or speech-based).

09
Flag

Eight quality checks raise issues; flagged trials carry a review-required signal in the dashboard.

10
Review

The researcher inspects the trial, overrides any decision, and locks the result.

SIGNAL
A fast yes/no. e.g. WCPM, accuracy, flags-count.
DECISION
A drill-down for context. e.g. waveform, gap inspector, label distribution.
ACTION
A full review with overrides. e.g. relabel a token, exclude a gap, retry the trial.
/03
Parallel tracks

Two parallel tracks. Same audio, different questions.

Speech-to-text answers what was said. Voice activity detection answers when was it said. The two tracks are computed independently, then their outputs are checked against each other before the metrics are derived.

Speech-to-textword-level transcription with timestamps

transcription

The transcript is not produced blindly. Before the audio is sent to the engine, ReadLab analyses the reference passage, filters common Persian stopwords, and favours custom, rare, long, or hard-to-transcribe terms as decoder hints. Researchers can also provide their own keyterms in the experiment config. The point is that uncommon vocabulary, names, and domain-specific terms have a better chance of surviving transcription instead of being normalised away by the model's prior.

patience412 → 720 ms
and730 → 810 ms
precision820 → 1180 ms
in1190 → 1290 ms
every1300 → 1410 ms
و از که anastomosis petrichor susurrus Aldebaran forsooth

Voice activity detectionspeech / silence boundaries

boundaries

Voice activity detection is a separate analysis that ignores the words entirely. It asks where speech is likely happening and where it is not, using voice structure rather than raw volume alone. Non-speech noise is usually kept out of the timing window, while speech-like background audio can still need review. Without this track, pre-reading silence, post-reading silence, and long pauses inside a passage would all inflate reading time.

onset 412 ms
offset 6 814 ms
speech 5.84 s
longest gap 620 ms
speech
pause
first speech
last speech
/04
Normalization & alignment

Same sound, different spelling. The engine knows.

ReadLab folds each word into matching forms and runs alignment on a weighted cost ladder. English uses the same exact-match and character-similarity path shown below; Persian currently has the deeper phonetic tuning, where confusable letters such as ص ث س or ز ذ ض ظ can be treated as the same sound instead of a mispronunciation.

Normalization output

original
normalized
internal key
Running!
running
running
مي‌روم
می‌روم
مYرVم
صبر
صبر
Sبر
ثبر
ثبر
Sبر
English   key equals the normalized word.
Persian   Arabic/Persian variants normalize first; confusable letters then share one internal key.

Weighted alignment

Hover any token to see its matched pair and the cost the engine paid to make that match.
the cat running mat
the cat runing truck
the ↔ the cost 0.00 · exact match
running ↔ runing cost 0.30 · high char similarity
mat ↔ truck cost 2.10 · prefer del + ins
Substitution cost ladder

Six tiers of similarity, one DP table

Every possible word-pair has a price. Identity is free. In Persian, a confusable-letter difference can be almost free. As similarity drops, the price rises until it exceeds the cost of skipping the reference word and inserting the transcript word as a separate event, which stops the system from inventing mispronunciations between unrelated words.

0.00
Exact match
normalized strings identical
cat ↔ cat
0.15
Phonetic match
internal keys identical · strongest in Persian today
صبر ↔ ثبر
0.30
High char similarity
charLev ≥ 0.7 · likely typo or affix variant
running ↔ runing
0.50
Medium similarity
charLev ≥ 0.5 · shared letters, similar order
kitten ↔ sitting
0.80
Low similarity
charLev ≥ 0.3 · faintly resembles, probably not a real attempt
night ↔ nite
2.00
Delete + insert
implicit breakeven · cost of treating the pair as two separate events
·· algorithm switches above this line ··
2.10
Unrelated
charLev < 0.3 · intentionally above 2.00 so DP rejects substitution
cat ↔ truck
Above the breakeven of 2.00, the algorithm always prefers to delete the reference word and insert the transcript word as two separate events rather than force a substitution. That single choice is what lets the labels downstream stay honest: a reference word that was skipped reads as omission, the unrelated transcript word reads as insertion, and nothing pretends to be a mispronunciation that wasn't.
/05
Rule engine

Six labels. A strict cascade. Conservative defaults.

Every aligned token receives exactly one label. The labelling logic is explicit, ordered, and easy to audit: assign a base label from the alignment, detect repetitions where the participant said the same word or phrase again, detect self-corrections where a related wrong attempt is followed by the right word, tighten mispronunciation behind a confidence gate, then annotate any unusually long gap.

CORRECT

Read accurately

Either the normalised forms are identical, or the phonetic keys are identical. A confusable-letter difference still counts as correct.

The cat sat on the mat.
ORF status: counted as correct
OMISSION

Skipped a word

A reference word with no aligned counterpart in the transcript inside the search window. The participant did not read it.

The cat sat on the mat.
ORF status: excluded from correct count
INSERTION

Added a word

A spoken word that does not correspond to any reference word, and that does not qualify as a repetition or self-correction.

The cat quickly sat on the mat.
ORF status: not counted
REPETITION

Said it twice

The same spoken word adjacent or one gap apart, where the reference contains it only once. Phrase-level repetitions are also detected.

The cat sat sat on the mat.
ORF status: first instance counted
SELF-CORRECTION

Tried, then corrected

A failed attempt that is phonetically or visually related to the target, followed by the correct word within two tokens. The recovery is what is counted.

The cap... cat sat on the mat.
ORF status: final attempt counted
MISPRONUNCIATION

Wrong sounds

A substitution that is similar enough to be a real attempt, but not close enough to count as correct.

The cat sat on the mast.
ORF status: excluded from correct count
The five-step cascade

Each step inspects, refines, or defers

01
Base labels from alignment

Match → correct. Phonetic match → correct. Deletion → omission. Insertion → insertion. Substitution → tentative mispronunciation, but only if the alignment confidence exceeds zero.

02
Detect repetitions

An insertion that repeats the previous spoken word is promoted to repetition. Both single-word and multi-word phrase repetitions are detected. The reference is consulted to avoid mistaking a genuine duplicate for a repetition.

03
Detect self-corrections

A wrong attempt followed within two tokens by the correct word becomes self-correction, but only if the attempt and the target share enough character overlap or a phonetic key. Short, unrelated pre-reading speech next to a correct word does not count.

04
Tighten mispronunciation

The aligner usually splits unrelated words before this step. If a very weak substitution still reaches the rule engine, it is not kept as a mispronunciation. Phonetic-key matches and character similarity at or above 0.8 both revert to correct as likely transcription artefacts. Mispronunciation survives only inside the meaningful middle band.

not mispron.
mispron.
correct (artefact)
0.30
0.80
05
Annotate pauses

Gaps longer than the configured threshold are recorded as a property on the following token. They never override a label. They are context the researcher reads alongside the label, not a scoring event.

!
A wrong label is worse than no label. When the evidence is weak, the system avoids over-calling errors. CORE PHILOSOPHY · Rule engine
/06
Metrics & flags

Two layers of measurement.

The first layer reports the headline metrics for every trial: words correct per minute, accuracy, coverage, and the timing windows behind them. The second layer stays silent until something looks unusual. Every flag has a configurable threshold, and any flag marks the trial for reviewer attention.

/ TIER ONE · REPORTED ON EVERY TRIAL

Reading-speed formula

WPM words per minute
reference word count
effective duration ÷ 60
WCPM words correct per minutePRIMARY
correct word count
effective duration ÷ 60

The numerator comes from the labelling cascade. The denominator comes from the duration profile. The dashboard keeps the saved automatic metrics, then recomputes final metrics when a reviewer changes labels or duration.

Effective duration · profile A vs B

A · exposure page display → space
B · speech first speech → last speech
page0 ms
first speech412 ms
last speech6 814 ms
space7 240 ms
Profile A · 7.24 s
Total exposure time. Simple and robust, but inflated by silence before reading begins and by hesitation before the participant presses space.
Profile B (default) · 6.40 s
Voice-activity boundaries only. Closer to standard ORF timing practice. Removes pre- and post-reading silence. The researcher can override the start, end, or excluded gaps per trial.
/ TIER TWO · RAISES A FLAG WHEN A THRESHOLD IS CROSSED
Every threshold is set in the experiment CSV. The defaults shown below are the values shipped with the platform; researchers raise or lower them per study. Any single crossed threshold sets reviewRequired = true so the dashboard can surface the trial for closer review.
High onset latency speech_onset_latency_ms > 5 000 ms
What it asks. How long after the page appeared did the participant actually start speaking?
Why it matters. A long latency suggests the participant did not yet understand the task, was confused by the text, or was distracted. Severity escalates when the value is more than twice the configured threshold.
What you do. Listen to the opening of the recording. Either accept the latency, or override the duration start to ignore the dead air.
High post-speech lag post_speech_lag_ms > 5 000 ms
What it asks. How long between the participant's last spoken word and pressing space to advance?
Why it matters. Long lag inflates Profile A duration. It often means the participant finished and then sat for several seconds, which is interesting behaviorally but should not depress their WCPM.
What you do. Profile B already excludes this lag. If using Profile A, consider an override.
Long internal gap longest_gap_ms > 10 000 ms
What it asks. What is the longest silence between two adjacent spoken words?
Why it matters. A multi-second pause inside a passage is rarely natural reading behaviour. The participant may have lost their place, encountered an unfamiliar word, or been interrupted.
What you do. Inspect the gap in the waveform. Researchers can mark a specific gap as excluded so it does not affect duration or speed.
Low reference coverage reference_coverage_pct < 50 %
What it asks. What fraction of the reference passage was actually read?
Why it matters. A trial where less than half the passage was read should not produce a publishable WCPM. The participant may have stopped early, skipped a section, or misunderstood the instructions.
What you do. Decide whether to exclude the trial, retry it, or treat it as a partial reading.
High off-reference ratio off_reference_pct > 40 %
What it asks. What fraction of the participant's spoken words could not be aligned to any reference word?
Why it matters. Lots of off-reference speech usually means the participant read a different passage, paraphrased, or carried on a side conversation during the recording. It also catches transcription drift.
What you do. Confirm the participant was on the right passage; if so, decide whether to label or exclude.
Large start delta start_delta_ms | Δ | > 3 000 ms
What it asks. Is there a large gap between when voice-activity says speech started and when the first reference word was aligned?
Why it matters. A big positive delta means the participant said something off-reference before reading. A big negative delta usually points to a boundary error in either detector. Either way, the timing window cannot be trusted without a look.
What you do. Listen to the opening seconds. If the participant is reading from the start, override; if they are warming up off-script, accept the delta.
Large end delta end_delta_ms | Δ | > 3 000 ms
What it asks. Is there a large gap between when voice-activity says speech ended and when the last reference word was aligned?
Why it matters. Off-reference speech after the passage, a long trailing word that voice-activity didn't catch, or a mismatched final boundary all show up here.
What you do. Inspect the closing seconds and decide whether the duration's end source is right.
Low speech percentage speech_pct < 30 %
What it asks. What fraction of the trial duration was actually speech, as opposed to silence?
Why it matters. Very low speech-percentage trials are dominated by silence and produce noisy reading-speed numbers. They are typically partial reads, very slow reads, or recordings with audio problems.
What you do. Confirm the recording is intact and decide whether the trial should count.
/07
Researcher dashboard

Three layers of depth, always traceable.

A KPI strip at the top reports speed, accuracy, and coverage. The waveform shows the audio with the speech overlay. The token grid lists every word with its label; right-clicking any token opens an override menu, and metrics recompute live as the researcher edits.

Trials Participants Export Settings
P-4F2A · trial 03 · review
WCPM
187 w/min
▲ +6 vs round 1
Accuracy
94.2%
▲ +1.4
Coverage
100%
full passage
Flags
2
needs review
Audio · with voice-activity overlay
Tokens · right-click to override
The cat sat on the mat while while the dog was slee… sleeping in the sun all afternoon
Active flags
!
High onset latency

speech_onset_latency_ms = 1 840 (threshold 1 500)

!
Long internal gap

longest_gap_ms = 2 310 between tokens 8 and 9

i
Cross-check passed

start_delta_ms = 38 · end_delta_ms = 142

Review state
committed in review review passed locked
Researcher annotating an ORF transcript by hand
The same act, without instruments. ReadLab brings the rigor of in-room ORF annotation to remote, browser-based studies.

Every parameter configurable. Every label overridable. Every metric traceable.

/ CONFIG
CSV-driven. No code changes per experiment.
/ TRANSPARENCY
Every token, every cost, every threshold inspectable.
/ DEFENSIBILITY
Outputs that can be defended in a methods section.