ReadLab is a configurable, browser-based platform for oral reading fluency, used both as an assessment tool that captures and analyses recordings end-to-end, and as a training tool: researchers configure trials, rounds, and scheduled intervals between them, then watch each participant's reading change over time. The pipeline is language-aware; the current rule tuning is deepest for Persian, with English running through the same analysis path.
Almost everything about a session is configurable: presentation mode, animation unit, font and size, text colour, layout and box width, reveal speed, the number of rounds and the scheduled gap between them, which passages each round draws from, and the rules that compute speed and accuracy. The participant simply opens a link and reads. The flow below is what they actually see.
Participants enter a code, or arrive on a Prolific URL that auto-binds them to a study. The session resumes if interrupted.
A short calibration checks signal level and clipping, and records the noise floor for context. Participants who fail three attempts are blocked from continuing.
Two presentation modes. Reveal animates word, phrase, line, or sentence at a configured pace. Static shows the page at once; no-scroll sessions split long passages into screen-sized pages.
Audio is recorded during the reading page. Energy is tracked client-side, while the audio and trial package are uploaded for server analysis when the page ends.
The participant presses space to move on. Multi-day delays between rounds are enforced by the scheduler. The next round simply will not open until the configured interval has passed.
Words appear one at a time, or by phrase, line, or sentence, at a speed the researcher configures. Each unit is hidden until its turn, then becomes visible. The renderer keeps its own clock; the timing never depends on speech-to-text.
The passage is shown all at once when it fits. Longer no-scroll passages are split into pages sized for the participant's screen, avoiding scrolling and hidden text while still allowing natural reading.
When the researcher processes a recording, every stage runs in order and writes durable outputs for review. Each step has a single, narrow job, and the researcher can inspect the trial record and its detailed analysis payload.
Audio captured in the browser is sent to storage and registered against the trial.
Speech-to-text produces a word sequence with millisecond timestamps for each token.
Voice activity detection finds first speech, last speech, voiced regions, and long pauses.
Each word is folded into a canonical form so spelling variants and diacritics do not block matching.
Reference and transcript are matched word-by-word using a weighted edit-distance algorithm.
Each aligned token receives one of six labels through a strict five-step cascade.
When enabled, selected low-confidence tokens are sent to an AI adjudicator for a second opinion.
Effective reading duration is computed under the configured profile (exposure-based or speech-based).
Eight quality checks raise issues; flagged trials carry a review-required signal in the dashboard.
The researcher inspects the trial, overrides any decision, and locks the result.
Speech-to-text answers what was said. Voice activity detection answers when was it said. The two tracks are computed independently, then their outputs are checked against each other before the metrics are derived.
The transcript is not produced blindly. Before the audio is sent to the engine, ReadLab analyses the reference passage, filters common Persian stopwords, and favours custom, rare, long, or hard-to-transcribe terms as decoder hints. Researchers can also provide their own keyterms in the experiment config. The point is that uncommon vocabulary, names, and domain-specific terms have a better chance of surviving transcription instead of being normalised away by the model's prior.
Voice activity detection is a separate analysis that ignores the words entirely. It asks where speech is likely happening and where it is not, using voice structure rather than raw volume alone. Non-speech noise is usually kept out of the timing window, while speech-like background audio can still need review. Without this track, pre-reading silence, post-reading silence, and long pauses inside a passage would all inflate reading time.
ReadLab folds each word into matching forms and runs alignment on a weighted cost ladder. English uses the same exact-match and character-similarity path shown below; Persian currently has the deeper phonetic tuning, where confusable letters such as ص ث س or ز ذ ض ظ can be treated as the same sound instead of a mispronunciation.
Every possible word-pair has a price. Identity is free. In Persian, a confusable-letter difference can be almost free. As similarity drops, the price rises until it exceeds the cost of skipping the reference word and inserting the transcript word as a separate event, which stops the system from inventing mispronunciations between unrelated words.
Every aligned token receives exactly one label. The labelling logic is explicit, ordered, and easy to audit: assign a base label from the alignment, detect repetitions where the participant said the same word or phrase again, detect self-corrections where a related wrong attempt is followed by the right word, tighten mispronunciation behind a confidence gate, then annotate any unusually long gap.
Either the normalised forms are identical, or the phonetic keys are identical. A confusable-letter difference still counts as correct.
A reference word with no aligned counterpart in the transcript inside the search window. The participant did not read it.
A spoken word that does not correspond to any reference word, and that does not qualify as a repetition or self-correction.
The same spoken word adjacent or one gap apart, where the reference contains it only once. Phrase-level repetitions are also detected.
A failed attempt that is phonetically or visually related to the target, followed by the correct word within two tokens. The recovery is what is counted.
A substitution that is similar enough to be a real attempt, but not close enough to count as correct.
Match → correct. Phonetic match → correct. Deletion → omission. Insertion → insertion. Substitution → tentative mispronunciation, but only if the alignment confidence exceeds zero.
An insertion that repeats the previous spoken word is promoted to repetition. Both single-word and multi-word phrase repetitions are detected. The reference is consulted to avoid mistaking a genuine duplicate for a repetition.
A wrong attempt followed within two tokens by the correct word becomes self-correction, but only if the attempt and the target share enough character overlap or a phonetic key. Short, unrelated pre-reading speech next to a correct word does not count.
The aligner usually splits unrelated words before this step. If a very weak substitution still reaches the rule engine, it is not kept as a mispronunciation. Phonetic-key matches and character similarity at or above 0.8 both revert to correct as likely transcription artefacts. Mispronunciation survives only inside the meaningful middle band.
Gaps longer than the configured threshold are recorded as a property on the following token. They never override a label. They are context the researcher reads alongside the label, not a scoring event.
A wrong label is worse than no label. When the evidence is weak, the system avoids over-calling errors.CORE PHILOSOPHY · Rule engine
The first layer reports the headline metrics for every trial: words correct per minute, accuracy, coverage, and the timing windows behind them. The second layer stays silent until something looks unusual. Every flag has a configurable threshold, and any flag marks the trial for reviewer attention.
The numerator comes from the labelling cascade. The denominator comes from the duration profile. The dashboard keeps the saved automatic metrics, then recomputes final metrics when a reviewer changes labels or duration.
A KPI strip at the top reports speed, accuracy, and coverage. The waveform shows the audio with the speech overlay. The token grid lists every word with its label; right-clicking any token opens an override menu, and metrics recompute live as the researcher edits.
speech_onset_latency_ms = 1 840 (threshold 1 500)
longest_gap_ms = 2 310 between tokens 8 and 9
start_delta_ms = 38 · end_delta_ms = 142