• Documentation
  • Pricing
  • Training Explore free online learning resources from videos to hands-on-labs
  • Blog Read the latest posts from the Azure team
  • Free account

    Speech to Text

    Swiftly convert audio to text for natural responsiveness.

    Cognitive Services Speech to Text offers a range of capabilities you can embed into your apps to support various transcription scenarios, including conversation transcription, speech transcription, and custom speech transcription.

    Conversation transcription

    Enable in-person meeting transcription. Conversation transcription captures speech in real time so that all meeting participants can fully engage in the discussion, identify who said what, when, and quickly follow up on next steps.

    Use conversation transcription to:

  • Capture speech from all around the meeting room.
  • Help safeguard data with industry-leading security and compliance certifications.
  • Support meeting and conference setups that use microphones and video cameras, through pairing with the Speech Devices SDK.
  • See it in action

    An error occurred while loading this demo, please wait and try again

    Speaker Transcript

    This demo is incompatible with your browser. For best experience, please use a different browser.

    Want to build this?

    Speech transcription

    Convert spoken audio to text. Call the API to recognize audio coming from the microphone, from other real-time streaming audio sources, or from a recorded audio file. As audio is sent to the server, partial recognition results are returned if requested.

    You can use the API to build voice-triggered smart apps. Try the demo to see how it works. Select your target language, then click on the microphone and start speaking. Or simply click on one of the sample speech phrases.*

    See it in action

    To try out the demo with your own voice using a microphone, please change to a different browser with WebRTC support, for example a recent version of Microsoft Edge, Firefox or Chrome.

    Want to build this?

    Custom speech service: Speech Transcription with Custom Model

    Overcome speech recognition barriers such as speaking style, vocabulary, and background noise. Our speech recognition technologies combine multiple APIs to produce the text output. Customers can customize the APIs to their needs and available data.

    See it in action

    Sample Sentences


    Custom Speech

    Create custom language models tailored to users’ speaking styles

    Don’t let varied vocabularies and speaking styles block understanding. Customize the language model of your app’s speech recognition by tailoring it to your industry expressions, technical, geography or market terms, and even speaker style.

    Adapt to user environment with custom acoustic models

    Make sure your app’s speech recognition can function in all environments. With custom acoustic models, you can account for background noise and match your users’ expected environments.

    Use robust speech models from Microsoft

    Enable powerful, personalized speech recognition by building your own customized speech recognition models on top of Microsoft’s existing state-of-the-art models.

    Want to build this?

    Explore a speech scenario

    Call center

    Serviços de VozCom os Serviços de Voz, é fácil transcrever todas as chamadas. Indexe a transcrição para pesquisa em texto completo ou aplique a Análise de Texto para detetar o sentimento, o idioma e expressões-chave. Se as suas gravações de centro de atendimento telefónico envolverem terminologia especializada (por exemplo, nomes de produtos ou gíria informática), crie um modelo de linguagem personalizado para ensinar aos Serviços de Voz esse vocabulário. Um modelo acústico personalizado ajuda os Serviços de Voz a compreender os oradores mesmo com ruído de fundo ou ligações telefónicas fracas. Para mais informações, leia sobre como funciona a transcrição em lote com os Serviços de Voz.
    1. Overview
    2. Flow

    Speech Services


    With Speech Services, it's easy to transcribe every call. Index the transcription for full-text search, or apply Text Analytics to detect sentiment, language, and key phrases for insights. If your call center recordings involve specialized terminology, such as product names or IT jargon, create a custom language model to teach Speech Services the vocabulary. A custom acoustic model helps Speech Services understand speakers even with background noise or poor phone connections.

    For more information, read how batch transcription works with Speech Services.


    1. 1 Adapt a model for your domain and deploy that model
    2. 2 Upload your recordings to a blob container
    3. 3 Create a POST request to batch transcription
    4. 4 Speech Services schedules the transcription job
    5. 5 Stereo files are split into two channels
    6. 6 Mono files undergo diarization to distinguish between speakers
    7. 7 Download the transcription using the transcription ID

    Explore the Cognitive Services APIs

    Computer Vision

    Distill actionable information from images


    Detect, identify, analyze, organize, and tag faces in photos

    Ink Recognizer PREVIEW

    An AI service that recognizes digital ink content, such as handwriting, shapes, and ink document layout

    Video Indexer

    Unlock video insights

    Custom Vision

    Easily customize your own state-of-the-art computer vision models for your unique use case

    Form Recognizer PREVIEW

    The AI-powered document extraction service that understands your forms

    Text Analytics

    Easily evaluate sentiment and topics to understand what users want

    Translator Text

    Easily conduct machine translation with a simple REST API call

    QnA Maker

    Distill information into conversational, easy-to-navigate answers

    Language Understanding

    Teach your apps to understand commands from your users

    Immersive Reader PREVIEW

    Empower users of all ages and abilities to read and comprehend text

    Speech Services

    Unified speech services for speech-to-text, text-to-speech and speech translation

    Speaker Recognition PREVIEW

    Use speech to identify and verify individual speakers

    Content Moderator

    Automated image, text, and video moderation

    Anomaly Detector PREVIEW

    Easily add anomaly detection capabilities to your apps.

    Personalizer PREVIEW

    An AI service that delivers a personalized user experience

    Use the Speech Devices SDK to build an ambient device and create a custom wake word

    Learn more