A Sydney consultant I work with runs three to four Zoom calls a day with international clients. Each call ends with action items she captures in a fresh document while the conversation is still fresh in her head. The problem isn't catching the big decisions — it's the specifics. The exact phrasing someone used about scope. The date a counterparty offered before they backtracked. The technical term that turned out to be the key to the problem.

Her current solution is to record every call, watch them back at 1.5× speed, and capture the missing detail. That works but it doubles the time spent per call. The five-minute fix is to transcribe the Zoom recording instead — a five-minute consultant task that returns a searchable, copyable transcript in the time it takes to make a coffee.

This guide walks through the practical mechanics: which Zoom recording option to use, how to get the audio out, what to do with the M4A file Zoom produces, and the privacy considerations that matter if you're operating an Australian business under the Privacy Act.

Step 1: Record the call properly

Zoom offers two recording options that affect what you can transcribe later:

Local recording (the default for free plans, optional for paid). The host's machine captures audio + video to a folder on disk. Comes out as separate audio (M4A) and video (MP4) files. Available immediately when the call ends.

Cloud recording (paid plans only). Recording happens on Zoom's servers; you download it from your Zoom account after processing. Adds 15–30 minutes of processing time but produces speaker-separated audio if you enable that option.

For transcription purposes, both work. The differences worth knowing:

Local recordings are immediate but lower-quality audio (the encoder runs on your machine in parallel with the call itself). Fine for transcription, not great for audio republishing.
Cloud recordings with the "separate audio file for each participant" option produce per-speaker audio files. Whisper-based transcription still doesn't auto-label speakers, but if you transcribe each file separately you can manually label after.

Pick local recording if: you want the transcript as fast as possible and don't care about speaker labels.

Pick cloud recording with separate-speaker option if: you need to know who said what for legal, journalistic, or panel-style content.

Critical setting most people miss

In Zoom's recording settings, there's a checkbox "Record a separate audio file for each participant". It defaults to off. Tick it before your first important call. Once you've recorded without it, there's no way to back-derive per-speaker audio.

Also worth toggling on: "Optimise for 3rd party video editor", which prevents Zoom from re-encoding the audio in a lossy way that hurts transcription accuracy on softer voices.

Step 2: Locate the audio file

After the call, Zoom drops files in Documents/Zoom/YYYY-MM-DD HH.MM.SS Meeting Name/ on Mac and Windows. Inside you'll find:

audio_only.m4a — the mixed audio (what you want)
video1234567890.mp4 — full video
playback.m3u — Zoom's playback playlist
(if separate-speaker is on) audio0.m4a, audio1.m4a, etc. — per-participant

The M4A file is what you'll upload to your transcription service. It's typically 30–80 MB for a one-hour call, depending on how many people spoke and how loud the audio is.

A 60-minute Zoom recording at 64 kbps mono comes out about 28 MB. At Zoom's default 128 kbps stereo, more like 55 MB. Both are well under the 250 MB cap on speechtotext.au.

Step 3: Transcribe

You have three real options here. The trade-offs:

Option A — Zoom's built-in transcription

Zoom's cloud recordings include automatic transcription for paid Business + plans. Free for the volume; the catch is that Zoom's transcription quality on Australian English is mediocre — particularly when participants have accents or technical terminology comes up.

We benchmarked Zoom's transcription against the same audio we used for our broader benchmark. Zoom's word error rate landed at 12–14% on broad Australian accents, vs Whisper-large-v3's 6.4%. For a one-hour call with a thick accent, that's the difference between 6 wrong words and 13 wrong words per minute.

If you only have a handful of calls and accuracy isn't critical, the built-in option is free, fast, and zero-friction. For anything where you'll be acting on the content — sales-call notes, technical discussions, legal calls — the accuracy gap is enough to push you to a dedicated tool.

Option B — Upload to speechtotext.au

Drop the M4A file at [speechtotext.au](/). The free tier handles up to 15 minutes per file; Pro handles 3-hour files. A 60-minute call comes back in about 90 seconds. The transcript is searchable, copyable, and exportable as TXT or SRT.

Two practical notes:

Browser-side compression. The site automatically transcodes your 55 MB M4A to ~3 MB Opus in your browser before uploading. The compression happens via ffmpeg.wasm and takes maybe 10 seconds for a 60-minute file. The audio never gets re-encoded in any way that affects transcription accuracy (Whisper operates at 16 kHz mono internally regardless of input).

Privacy by default. Audio passes through memory only — no disk storage on the server. The transcript itself is saved to your account (if signed in) so you can re-open it later, or deleted instantly (if anonymous).

Option C — Other tools

Otter, Descript, Sonix, AssemblyAI — all built on similar Whisper architecture. The quality is comparable. Differentiation is on features: Otter for live transcription during the call, Descript for editing audio by editing transcript text, AssemblyAI for developers who want raw API access.

For Australian users specifically, speechtotext.au has two practical advantages: data residency (audio processed in Australia, transcripts in EU/AU) and AUD billing (no foreign exchange surprises on monthly invoices).

Step 4: Use the transcript

The transcript by itself is just text. What makes the workflow valuable is what you do with it:

Action item extraction. Search the transcript for words like "I'll", "we'll", "by Friday", "next week". Action items rarely hide in unexpected places.

Quote pulls. For sales follow-ups, the strongest hook is often a quote from the prospect's own words back to them. Searchable transcript makes this trivial.

Translation prep. If the call needs to be translated, having the English transcript first saves an interpreter 70% of the work.

Knowledge base contribution. For internal product or engineering calls, pasting the transcript into a Notion / Confluence page creates a search target for future "didn't we discuss this?" questions.

Search across history. Once you've been doing this for a few months, the most useful feature is grepping across all your call transcripts for a specific concept. "What was that vendor's name we talked about three months ago?"

speechtotext.au's history view lets you search across all your saved transcripts. We added that early because most users discover the search use case organically and start using it more than the actual transcription feature.

Privacy: where does the audio go

For Australian businesses, the question that matters most:

If your Zoom call discussed client information, commercial-in-confidence material, or anything covered by your data-handling policies, you need to know what happens to the audio file you upload.

Zoom itself: Zoom's privacy policy permits use of customer content for "service operation and improvement" (clause 10.3 of the November 2024 Terms). There's an opt-out flag in admin settings for enterprise plans. Australian Zoom calls can also be hosted on Australian data centres if you have a Business+ plan and enable AU residency.

speechtotext.au: Audio passes through memory only and is discarded after transcription. No on-disk storage on the server. Transcripts are saved per-user (or returned inline for anonymous use) and can be deleted from your account at any time. We're an Australian company subject to APP. The privacy page lays it out in plain English.

Otter, Descript, Rev, etc.: Each has its own data handling policy. The two questions worth asking any service:

Is the audio used to train future versions of the model?
Where is it stored geographically, and for how long?

A "yes" to question 1 or a "we can't say" on question 2 are flags.

Common transcription headaches and how to fix them

Problem: Transcript has the wrong people's names throughout. Fix: Whisper is good at common Western names but struggles with non-English names or culturally specific spellings. Quick find-and-replace once after the transcription completes.

Problem: The transcript runs together with no paragraph breaks. Fix: Most transcription tools (including speechtotext.au) produce raw text with timestamp segments. If you want clean paragraphs, paste into an AI tool (Claude, ChatGPT) and ask it to "format this transcript into readable paragraphs with speaker labels based on context".

Problem: Audio quality is bad and the transcript reflects it. Fix: There's not much to do at the transcription stage — garbage in, garbage out. Future recordings: use good headsets, record in a quiet room, sit close to your microphone, enable noise suppression in Zoom.

Problem: I have 50 hours of old Zoom recordings I never transcribed. Fix: Upload them in batches. The Pro tier ($19/month) at 600 minutes lets you process 10 hours per month. The Business tier ($49/month, 4,000 minutes) clears a backlog of 70+ hours in a single month.

A practical workflow template

For a knowledge worker doing 2–5 Zoom calls per day:

Pre-call (1 minute): Confirm cloud recording is on with separate-speaker option if you need labels.
Post-call (2 minutes): Wait for Zoom to process, download the audio_only.m4a file.
Transcription (3 minutes total, 30 seconds of attention): Drop the file at [speechtotext.au](/), get the transcript back.
Action capture (5 minutes): Scan the transcript, copy out action items into your task system, save the transcript to your call-notes folder.

Total active time: ~6 minutes of attention per call. Total elapsed time including transcription: ~8 minutes. Compared to "I'll re-watch the recording later", this is a 5–10× speedup.

FAQs

Is it legal to transcribe a Zoom call in Australia?

If you're a participant in the call, generally yes. Australian state laws vary slightly (Surveillance Devices Acts in NSW/VIC/QLD) but most permit recording or transcribing calls you're a party to. Always disclose recording to participants at the start of the call — it's required in some states and best practice everywhere.

Will the transcription be as accurate as Zoom's built-in?

Whisper-large-v3 (used by speechtotext.au) outperforms Zoom's built-in transcription by roughly 50% on Australian accents in our testing. The gap is largest on broad regional accents and technical vocabulary.

Can I transcribe a recording without a paid Zoom plan?

Yes. Local recording is included in the free Zoom tier. You record locally, download the M4A, and transcribe with any external tool. No paid Zoom plan required.

How long does transcription take for a 60-minute call?

On speechtotext.au, about 90 seconds from upload to transcript. On Whisper-based competitors, similar timing.

Does the transcript include timestamps?

Yes — every segment is timestamped down to the millisecond. The full transcript shows them inline; SRT export uses them for subtitles; you can copy out just the text without timestamps if you want clean prose.

What's the maximum file size?

speechtotext.au accepts files up to 250 MB. That's roughly 8 hours of mono audio or 4 hours of stereo. For comparison: Otter caps at 200 MB, Descript at 1 GB.

Can I batch-transcribe a folder of recordings?

Via the web UI, one at a time. We have an API in private beta for bulk processing — email us if you've got a backlog to clear and want an early access slot.

---

Try a single file free at [speechtotext.au](/) — no sign-up required. For ongoing use, Pro at $19/month covers ~10 hours of monthly call audio.