ReadLab · Oral reading fluency, instrumented for research

/01

Participant experience

A reading session, end to end.

Almost everything about a session is configurable: presentation mode, animation unit, font and size, text colour, layout and box width, reveal speed, the number of rounds and the scheduled gap between them, which passages each round draws from, and the rules that compute speed and accuracy. The participant simply opens a link and reads. The flow below is what they actually see.

/ 01 · LOGIN

Code or URL

Participants enter a code, or arrive on a Prolific URL that auto-binds them to a study. The session resumes if interrupted.

P-4F2A

/ 02 · PRE-FLIGHT

Microphone check

A short calibration checks signal level and clipping, and records the noise floor for context. Participants who fail three attempts are blocked from continuing.

/ 03 · READING

Reveal or static

Two presentation modes. Reveal animates word, phrase, line, or sentence at a configured pace. Static shows the page at once; no-scroll sessions split long passages into screen-sized pages.

صَبر و دقت پایه

/ 04 · CAPTURE

Voice recording

Audio is recorded during the reading page. Energy is tracked client-side, while the audio and trial package are uploaded for server analysis when the page ends.

/ 05 · ADVANCE

Space to continue

The participant presses space to move on. Multi-day delays between rounds are enforced by the scheduler. The next round simply will not open until the configured interval has passed.

SPACE

/ MODE A

Reveal animation

Words appear one at a time, or by phrase, line, or sentence, at a speed the researcher configures. Each unit is hidden until its turn, then becomes visible. The renderer keeps its own clock; the timing never depends on speech-to-text.

Patience and precision in every word are the root of literacy.

word-by-word reveal 220 WPM · auto-pace

/ MODE B

Static, viewport-fit

The passage is shown all at once when it fits. Longer no-scroll passages are split into pages sized for the participant's screen, avoiding scrolling and hidden text while still allowing natural reading.

page 1 / 2

Patience and precision in every word are the root of literacy. Accurate reading takes practice, and every language sets its own bar.

page 2 / 2

Research without precise measurement is impossible. Every datum we collect must be inspectable, adjustable, and traceable.

no-scroll, viewport-fit auto-paginated

/02

The engine

A ten-stage pipeline from raw audio to a reviewable trial.

When the researcher processes a recording, every stage runs in order and writes durable outputs for review. Each step has a single, narrow job, and the researcher can inspect the trial record and its detailed analysis payload.

Upload

Audio captured in the browser is sent to storage and registered against the trial.

Transcribe

Speech-to-text produces a word sequence with millisecond timestamps for each token.

Detect speech

Voice activity detection finds first speech, last speech, voiced regions, and long pauses.

Normalize

Each word is folded into a canonical form so spelling variants and diacritics do not block matching.

Align

Reference and transcript are matched word-by-word using a weighted edit-distance algorithm.

Label

Each aligned token receives one of six labels through a strict five-step cascade.

Adjudicate

When enabled, selected low-confidence tokens are sent to an AI adjudicator for a second opinion.

Duration

Effective reading duration is computed under the configured profile (exposure-based or speech-based).

Flag

Eight quality checks raise issues; flagged trials carry a review-required signal in the dashboard.

Review

The researcher inspects the trial, overrides any decision, and locks the result.

SIGNAL

A fast yes/no. e.g. WCPM, accuracy, flags-count.

DECISION

A drill-down for context. e.g. waveform, gap inspector, label distribution.

ACTION

A full review with overrides. e.g. relabel a token, exclude a gap, retry the trial.

/03

Parallel tracks

Two parallel tracks. Same audio, different questions.

Speech-to-text answers what was said. Voice activity detection answers when was it said. The two tracks are computed independently, then their outputs are checked against each other before the metrics are derived.

Speech-to-textword-level transcription with timestamps

transcription

The transcript is not produced blindly. Before the audio is sent to the engine, ReadLab analyses the reference passage, filters common Persian stopwords, and favours custom, rare, long, or hard-to-transcribe terms as decoder hints. Researchers can also provide their own keyterms in the experiment config. The point is that uncommon vocabulary, names, and domain-specific terms have a better chance of surviving transcription instead of being normalised away by the model's prior.

patience412 → 720 ms

and730 → 810 ms

precision820 → 1180 ms

in1190 → 1290 ms

every1300 → 1410 ms

و از که anastomosis petrichor susurrus Aldebaran forsooth

Voice activity detectionspeech / silence boundaries

boundaries

Voice activity detection is a separate analysis that ignores the words entirely. It asks where speech is likely happening and where it is not, using voice structure rather than raw volume alone. Non-speech noise is usually kept out of the timing window, while speech-like background audio can still need review. Without this track, pre-reading silence, post-reading silence, and long pauses inside a passage would all inflate reading time.

onset 412 ms

offset 6 814 ms

speech 5.84 s

longest gap 620 ms

speech

pause

first speech

last speech

/04

Normalization & alignment

Same sound, different spelling. The engine knows.

ReadLab folds each word into matching forms and runs alignment on a weighted cost ladder. English uses the same exact-match and character-similarity path shown below; Persian currently has the deeper phonetic tuning, where confusable letters such as ص ث س or ز ذ ض ظ can be treated as the same sound instead of a mispronunciation.

Normalization output

original

normalized

internal key

Running!

→

running

→

running

مي‌روم

→

می‌روم

→

مYرVم

صبر

→

صبر

→

Sبر

ثبر

→

ثبر

→

Sبر

English key equals the normalized word.
Persian Arabic/Persian variants normalize first; confusable letters then share one internal key.

Weighted alignment

Hover any token to see its matched pair and the cost the engine paid to make that match.

the cat running mat

the cat runing truck

the ↔ the cost 0.00 · exact match

running ↔ runing cost 0.30 · high char similarity

mat ↔ truck cost 2.10 · prefer del + ins

Substitution cost ladder

Six tiers of similarity, one DP table

Every possible word-pair has a price. Identity is free. In Persian, a confusable-letter difference can be almost free. As similarity drops, the price rises until it exceeds the cost of skipping the reference word and inserting the transcript word as a separate event, which stops the system from inventing mispronunciations between unrelated words.

0.00

Exact match

normalized strings identical

cat ↔ cat

0.15

Phonetic match

internal keys identical · strongest in Persian today

صبر ↔ ثبر

0.30

High char similarity

charLev ≥ 0.7 · likely typo or affix variant

running ↔ runing

0.50

Medium similarity

charLev ≥ 0.5 · shared letters, similar order

kitten ↔ sitting

0.80

Low similarity

charLev ≥ 0.3 · faintly resembles, probably not a real attempt

night ↔ nite

2.00

Delete + insert

implicit breakeven · cost of treating the pair as two separate events

·· algorithm switches above this line ··

2.10

Unrelated

charLev < 0.3 · intentionally above 2.00 so DP rejects substitution

cat ↔ truck

Above the breakeven of 2.00, the algorithm always prefers to delete the reference word and insert the transcript word as two separate events rather than force a substitution. That single choice is what lets the labels downstream stay honest: a reference word that was skipped reads as omission, the unrelated transcript word reads as insertion, and nothing pretends to be a mispronunciation that wasn't.

/05

Rule engine

Six labels. A strict cascade. Conservative defaults.

Every aligned token receives exactly one label. The labelling logic is explicit, ordered, and easy to audit: assign a base label from the alignment, detect repetitions where the participant said the same word or phrase again, detect self-corrections where a related wrong attempt is followed by the right word, tighten mispronunciation behind a confidence gate, then annotate any unusually long gap.

CORRECT

Read accurately

Either the normalised forms are identical, or the phonetic keys are identical. A confusable-letter difference still counts as correct.

The cat sat on the mat.

ORF status: counted as correct

OMISSION

Skipped a word

A reference word with no aligned counterpart in the transcript inside the search window. The participant did not read it.

The cat sat on the mat.

ORF status: excluded from correct count

INSERTION

Added a word

A spoken word that does not correspond to any reference word, and that does not qualify as a repetition or self-correction.

The cat quickly sat on the mat.

ORF status: not counted

REPETITION

Said it twice

The same spoken word adjacent or one gap apart, where the reference contains it only once. Phrase-level repetitions are also detected.

The cat sat sat on the mat.

ORF status: first instance counted

SELF-CORRECTION

Tried, then corrected

A failed attempt that is phonetically or visually related to the target, followed by the correct word within two tokens. The recovery is what is counted.

The cap... cat sat on the mat.

ORF status: final attempt counted

MISPRONUNCIATION

Wrong sounds

A substitution that is similar enough to be a real attempt, but not close enough to count as correct.

The cat sat on the mast.

ORF status: excluded from correct count

The five-step cascade

Each step inspects, refines, or defers

Base labels from alignment

Match → correct. Phonetic match → correct. Deletion → omission. Insertion → insertion. Substitution → tentative mispronunciation, but only if the alignment confidence exceeds zero.

Detect repetitions

An insertion that repeats the previous spoken word is promoted to repetition. Both single-word and multi-word phrase repetitions are detected. The reference is consulted to avoid mistaking a genuine duplicate for a repetition.

Detect self-corrections

A wrong attempt followed within two tokens by the correct word becomes self-correction, but only if the attempt and the target share enough character overlap or a phonetic key. Short, unrelated pre-reading speech next to a correct word does not count.

Tighten mispronunciation

The aligner usually splits unrelated words before this step. If a very weak substitution still reaches the rule engine, it is not kept as a mispronunciation. Phonetic-key matches and character similarity at or above 0.8 both revert to correct as likely transcription artefacts. Mispronunciation survives only inside the meaningful middle band.

not mispron.

mispron.

correct (artefact)

0.30

0.80

Annotate pauses

Gaps longer than the configured threshold are recorded as a property on the following token. They never override a label. They are context the researcher reads alongside the label, not a scoring event.

A wrong label is worse than no label. When the evidence is weak, the system avoids over-calling errors. CORE PHILOSOPHY · Rule engine

/06

Metrics & flags

Two layers of measurement.

The first layer reports the headline metrics for every trial: words correct per minute, accuracy, coverage, and the timing windows behind them. The second layer stays silent until something looks unusual. Every flag has a configurable threshold, and any flag marks the trial for reviewer attention.

/ TIER ONE · REPORTED ON EVERY TRIAL

Reading-speed formula

WPM words per minute

reference word count

effective duration ÷ 60

WCPM words correct per minutePRIMARY

correct word count

effective duration ÷ 60

The numerator comes from the labelling cascade. The denominator comes from the duration profile. The dashboard keeps the saved automatic metrics, then recomputes final metrics when a reviewer changes labels or duration.

Effective duration · profile A vs B

A · exposure page display → space

B · speech first speech → last speech

page0 ms

first speech412 ms

last speech6 814 ms

space7 240 ms

Profile A · 7.24 s

Total exposure time. Simple and robust, but inflated by silence before reading begins and by hesitation before the participant presses space.

Profile B (default) · 6.40 s

Voice-activity boundaries only. Closer to standard ORF timing practice. Removes pre- and post-reading silence. The researcher can override the start, end, or excluded gaps per trial.

/ TIER TWO · RAISES A FLAG WHEN A THRESHOLD IS CROSSED

Every threshold is set in the experiment CSV. The defaults shown below are the values shipped with the platform; researchers raise or lower them per study. Any single crossed threshold sets reviewRequired = true so the dashboard can surface the trial for closer review.

High onset latency speech_onset_latency_ms > 5 000 ms ›

What it asks. How long after the page appeared did the participant actually start speaking?

Why it matters. A long latency suggests the participant did not yet understand the task, was confused by the text, or was distracted. Severity escalates when the value is more than twice the configured threshold.

What you do. Listen to the opening of the recording. Either accept the latency, or override the duration start to ignore the dead air.

High post-speech lag post_speech_lag_ms > 5 000 ms ›

What it asks. How long between the participant's last spoken word and pressing space to advance?

Why it matters. Long lag inflates Profile A duration. It often means the participant finished and then sat for several seconds, which is interesting behaviorally but should not depress their WCPM.

What you do. Profile B already excludes this lag. If using Profile A, consider an override.

Long internal gap longest_gap_ms > 10 000 ms ›

What it asks. What is the longest silence between two adjacent spoken words?

Why it matters. A multi-second pause inside a passage is rarely natural reading behaviour. The participant may have lost their place, encountered an unfamiliar word, or been interrupted.

What you do. Inspect the gap in the waveform. Researchers can mark a specific gap as excluded so it does not affect duration or speed.

Low reference coverage reference_coverage_pct < 50 % ›

What it asks. What fraction of the reference passage was actually read?

Why it matters. A trial where less than half the passage was read should not produce a publishable WCPM. The participant may have stopped early, skipped a section, or misunderstood the instructions.

What you do. Decide whether to exclude the trial, retry it, or treat it as a partial reading.

High off-reference ratio off_reference_pct > 40 % ›

What it asks. What fraction of the participant's spoken words could not be aligned to any reference word?

Why it matters. Lots of off-reference speech usually means the participant read a different passage, paraphrased, or carried on a side conversation during the recording. It also catches transcription drift.

What you do. Confirm the participant was on the right passage; if so, decide whether to label or exclude.

Large start delta start_delta_ms | Δ | > 3 000 ms ›

What it asks. Is there a large gap between when voice-activity says speech started and when the first reference word was aligned?

Why it matters. A big positive delta means the participant said something off-reference before reading. A big negative delta usually points to a boundary error in either detector. Either way, the timing window cannot be trusted without a look.

What you do. Listen to the opening seconds. If the participant is reading from the start, override; if they are warming up off-script, accept the delta.

Large end delta end_delta_ms | Δ | > 3 000 ms ›

What it asks. Is there a large gap between when voice-activity says speech ended and when the last reference word was aligned?

Why it matters. Off-reference speech after the passage, a long trailing word that voice-activity didn't catch, or a mismatched final boundary all show up here.

What you do. Inspect the closing seconds and decide whether the duration's end source is right.

Low speech percentage speech_pct < 30 % ›

What it asks. What fraction of the trial duration was actually speech, as opposed to silence?

Why it matters. Very low speech-percentage trials are dominated by silence and produce noisy reading-speed numbers. They are typically partial reads, very slow reads, or recordings with audio problems.

What you do. Confirm the recording is intact and decide whether the trial should count.

/07

Researcher dashboard

Three layers of depth, always traceable.

A KPI strip at the top reports speed, accuracy, and coverage. The waveform shows the audio with the speech overlay. The token grid lists every word with its label; right-clicking any token opens an override menu, and metrics recompute live as the researcher edits.

ReadLab

Trials Participants Export Settings

P-4F2A · trial 03 · review

WCPM

187 w/min

▲ +6 vs round 1

Accuracy

94.2%

▲ +1.4

Coverage

100%

full passage

Flags

needs review

Audio · with voice-activity overlay

Tokens · right-click to override

The cat sat on the mat while while the dog was slee… sleeping in the sun all afternoon

Active flags

High onset latency

speech_onset_latency_ms = 1 840 (threshold 1 500)

Long internal gap

longest_gap_ms = 2 310 between tokens 8 and 9

Cross-check passed

start_delta_ms = 38 · end_delta_ms = 142

Review state

committed → in review → review passed → locked

Researcher annotating an ORF transcript by hand

The same act, without instruments. ReadLab brings the rigor of in-room ORF annotation to remote, browser-based studies.

How do we measure reading when readers are online

A reading session, end to end.

Code or URL

Microphone check

Reveal or static

Voice recording

Space to continue

Reveal animation

Static, viewport-fit

A ten-stage pipeline from raw audio to a reviewable trial.

Upload

Transcribe

Detect speech

Normalize

Align

Label

Adjudicate

Duration

Flag

Review

Two parallel tracks. Same audio, different questions.

Speech-to-textword-level transcription with timestamps

Voice activity detectionspeech / silence boundaries

Same sound, different spelling. The engine knows.

Normalization output

Weighted alignment

Six tiers of similarity, one DP table

Six labels. A strict cascade. Conservative defaults.

Read accurately

Skipped a word

Added a word

Said it twice

Tried, then corrected

Wrong sounds

Each step inspects, refines, or defers

Base labels from alignment

Detect repetitions

Detect self-corrections

Tighten mispronunciation

Annotate pauses

Two layers of measurement.

Reading-speed formula

Effective duration · profile A vs B

Three layers of depth, always traceable.

Audio · with voice-activity overlay

Tokens · right-click to override

Active flags

High onset latency

Long internal gap

Cross-check passed

Review state

Every parameter configurable. Every label overridable. Every metric traceable.