Cry about...
Delphi Programming with TWebBrowser


How to get the HTML displayed in a TWebBrowser


There are three ways to get the HTML displayed in a web browser:

  1. Obtain the HTML from the WebBrowser DOM
  2. Obtain the HTML from the WebBrowser
  3. Obtain the HTML from the browser cache

There are advantages and disadvantages of each.


Obtain the HTML from the WebBrowser DOM

To retrieve the HTML directly from the WebBrowser's DOM:

function GetHtml(var webBrowser as TWebBrowser): String;
var document as IHTMLDocument2;
begin
  document := webBrowser.Document as IHTMLDocument2;
  result := document.body.innerHTML;
end;

This is simple and works well. The only (and main) problem with it is that it is returning the HTML that the web-browser has displayed and this is not necessarily the same as the original HTML. For example, if the original HTML file included:

<script type="text/javascript">
document.write('Hello');
</script>

then the HTML returned by the above function will contain the "Hello" but not the "<script ...". It also does not include any header information (such as keywords and the title).


Obtain the HTML from the WebBrowser

The following function will extract the HTML from a WebBrowser, including the header block as well as the body of the HTML:

function GetBrowserHtml(const webBrowser: TWebBrowser): String;
var
  strStream: TStringStream;
  adapter: IStream;
  browserStream: IPersistStreamInit;
begin
  strStream := TStringStream.Create('');
  try
    browserStream := webBrowser.Document as IPersistStreamInit;
    adapter := TStreamAdapter.Create(strStream,soReference);
    browserStream.Save(adapter,true);
    result := strStream.DataString;
  finally
  end;
  strStream.Free();
end;

Obtain the HTML from the browser cache

The following example shows how to retrieve the HTML from the browser cache:

var
  h_cachedInternet: HINTERNET;

function GetRawHtml(var web_browser: TWebBrowser): String;
var
  http_handle: HINTERNET;
  buffer: array [0..20] of Char;
  url: String;
  bytes_read: DWORD;
begin
  url := web_browser.LocationURL;
  http_handle := InternetOpenUrl(h_cachedInternet,
    PChar(url),nil,0,INTERNET_FLAG_NO_UI,0);
  if http_handle = nil then
    result := ''
  else
  begin
    //--------------------------------------------------------------
    // Retrieve the URL data. Hopefully this should be straight from
    // the cache because of how the internet connection was defined.
    //--------------------------------------------------------------
    result := '';
    repeat
      InternetReadFile(http_handle,@buffer,Length(buffer),bytes_read);
      result := result + Copy(buffer,1,bytes_read);
    until bytes_read =0;
    InternetCloseHandle(http_handle);
  end;
end;

initialization

//--------------------
// Initialise WinInet.
//--------------------
h_cachedInternet := InternetOpen(PChar(application.title),
  INTERNET_OPEN_TYPE_PRECONFIG_WITH_NO_AUTOPROXY,nil,nil,
  INTERNET_FLAG_FROM_CACHE);

This has the advantage that it does not require an instance of TWebBrowser, so will be more suited to some applications.

Note:

  • It is using WinInet functions and only uses the browser to obtain the URL.
  • It is reading the file directly from the WinInet file cache - it is therefore assumed that the file in the cache will be the same as that used by the TWebBrowser. The assumption is reasonable most of the time, but it is possible that the file may have been flushed from the cache, not cached or replaced by a different copy by another Web Browser.

See also: How to navigate a frameset.


These notes are believed to be correct for Delphi 6, but may apply to other versions as well.



About the author: is a dedicated software developer and webmaster. For his day job he develops websites and desktop applications as well as providing IT services. He moonlights as a technical author and consultant.