Remove all HTML tags from a string

Just starting out? Need help? Post your questions and find answers here.
StarWarsFan
Enthusiast
Enthusiast
Posts: 169
Joined: Sat Mar 14, 2015 11:53 am

Remove all HTML tags from a string

Post by StarWarsFan »

A search for "remove html" did not bring up anything useful, so I ask this here:

I got a string and want to remove everything that is HTML from it, be it a simple <p> or a long <a href>...link...</a>

Before I sit down and reinvent the wheel let me ask: Has anybody before me possibly already programmed that?
Image - There is usually a lot of "try this, maybe do that" but ONLY an example that one can test for themself and get an immediate result actually brings people forward.
User avatar
NicTheQuick
Addict
Addict
Posts: 1224
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: Remove all HTML tags from a string

Post by NicTheQuick »

Are we talking about plain HTML or XHTML? For XHTML you can use the XML library of Purebasic to first parse it and then extract only text you need.
There are also some Regular Expressions out there which can do the right thing in most cases: You can then use the RegularExpression library of Purebasic to remove all found tags.
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
Marc56us
Addict
Addict
Posts: 1477
Joined: Sat Feb 08, 2014 3:26 pm

Re: Remove all HTML tags from a string

Post by Marc56us »

StarWarsFan wrote: Wed May 18, 2022 11:41 am A search for "remove html" did not bring up anything useful, so I ask this here:
I got a string and want to remove everything that is HTML from it, be it a simple <p> or a long <a href>...link...</a>
Before I sit down and reinvent the wheel let me ask: Has anybody before me possibly already programmed that?
Most of the existing tools try (with moderate success) to filter a whole page and therefore take a lot of trouble to filter out scripts and other single tags.
If you only have only one string it is usually quite easy. Post a small example if needed, we can build you a custom regex if you don't know.
:wink:
AZJIO
Addict
Addict
Posts: 1319
Joined: Sun May 14, 2017 1:48 am

Re: Remove all HTML tags from a string

Post by AZJIO »

Still need to process pseudo-code

Code: Select all

&nbsp;
&quot;
&amp;
&lt;
&gt;
&iexcl;
&cent;
&pound;
&copy;
etc...

and number code

Code: Select all

&#(\d+);
It is necessary to connect to the browser engine object and request the page text from it. In AutoIt3 this is done through an object (_IEBodyReadText). Or via the properties command "innertext" (_IEPropertyGet)
StarWarsFan
Enthusiast
Enthusiast
Posts: 169
Joined: Sat Mar 14, 2015 11:53 am

Re: Remove all HTML tags from a string

Post by StarWarsFan »

Okay I shall do that myself then, anybody can join in if he so desires.

Ideas:
Tactically, I would put the HTML source-code into a$

For easy cases like "&nbsp;" that is easy,
there a simple a$= removestring(a$,"&nbsp;") should do.

Or you treat it as "@" for the start-tag and ";" for the end-tag.

I then search for the next tag that must in HTML of course start with "<" and end with ">"
for i= 1 to len (a$) so that the entire string is worked.

If I find such a tag-start("<"), I would then search until it finds the tag-end (">") and construct two strings,
let me make an example for a simple <b>test-text</b>
If I have "<b>" I can simply construct how the end-tag must look like, that is easily done by r2$= replacestring(r1$,"<b","</b")
r1$ would result in "<b>"
r2$ would result in "</b>"
And then I can give
a$= removestring(a$,r1$)
a$= removestring(a$,r2$)
and continue the loop

BUT: Let us assume longer tags like <a href.....>
There you do not have that entire tag as its end, you have a simply </a> there.

I must somehow discriminate case 1 from case 2 (that has options included)

Maybe look for an existing space-charater...
Image - There is usually a lot of "try this, maybe do that" but ONLY an example that one can test for themself and get an immediate result actually brings people forward.
User avatar
spikey
Enthusiast
Enthusiast
Posts: 581
Joined: Wed Sep 22, 2010 1:17 pm
Location: United Kingdom

Re: Remove all HTML tags from a string

Post by spikey »

You don't need to get anywhere near that complicated just to remove tags. You just need to traverse the string and set a flag when you enter or leave a tag. If you're in a tag, you omit the character from the output. If you aren't, you include it.

You will want to be more selective about entity codes though as you may change the semantic content of the string if you just remove them. For example removing @nbsp;, @ensp; or @emsp; will concatenate the two adjacent words. I'd use ReplaceString first to convert them to a space. Additionally you need to replace the &lt; and &gt; entities after you've removed all the tags otherwise they'll cause a fault.

Code: Select all

Define.S a, b, c
Define.I l, i, tag = #False

a = "<p>This is a paragraph.@ensp;It contains some <i>italic</i> text and some <b>bold</b> text.@ensp;" + 
    "It also has a link to the PureBasic <a href=" + #DQUOTE$ + "www.purebasic.com" + #DQUOTE$ + ">website.</a>@ensp;" +
    "PureBasic is @gt; Basic.</p>"

a = ReplaceString(a, "@ensp;", " ")
l = Len(a)

For i = 1 To l
  
  c = Mid(a, i, 1)
  If c = "<"
    tag = #True
    
  ElseIf c = ">"
    tag = #False
    
  ElseIf Not(tag)
    b + c
    
  EndIf

Next i

b = ReplaceString(b, "@gt;", ">")

Debug a
Debug b
infratec
Always Here
Always Here
Posts: 6818
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: Remove all HTML tags from a string

Post by infratec »

Maybe a bit faster (not compared)

Code: Select all

Define.S a, b
Define.I tag = #False
Define *HtmlPtr.Character, *Pos1

a = "<p>This is a paragraph.@ensp;It contains some <i>italic</i> text and some <b>bold</b> text.@ensp;" + 
    "It also has a link to the PureBasic <a href=" + #DQUOTE$ + "www.purebasic.com" + #DQUOTE$ + ">website.</a>@ensp;" +
    "PureBasic is @gt; Basic.</p>"

a = ReplaceString(a, "@ensp;", " ")

*HtmlPtr = @a
*Pos1 = @a
While *HtmlPtr\c
  
  If *HtmlPtr\c = '<'
    b + PeekS(*Pos1, (*HtmlPtr - *Pos1) >> 1)
    
  ElseIf *HtmlPtr\c = '>'
    *Pos1 = *HtmlPtr + 2
    
  EndIf
  
  *HtmlPtr + 2
  
Wend

b = ReplaceString(b, "@gt;", ">")

Debug a
Debug b
firace
Addict
Addict
Posts: 899
Joined: Wed Nov 09, 2011 8:58 am

Re: Remove all HTML tags from a string

Post by firace »

The below code uses built-in Windows functionality. Very simple demo at the end of the code

Code: Select all

; Extended WebBrowser Library functions by firace - partly adapted from code by freak

Define.s regkeyName, dwLabel, statusMsg, keyResult.i
Define.l dwValue, dwValueCheck

Define.l lastpressTimestamp = ElapsedMilliseconds()

Declare Async_OnPageChange() 

Procedure.s RegReadString(HKMain, HKSub$, HKEntry$) 
  hKey = 0
  If RegOpenKeyEx_(HKMain, HKSub$, 0, #KEY_QUERY_VALUE, @hKey) = #ERROR_SUCCESS 
    result$ = Space(4096)
    bufLen = Len(result$)
    If hKey 
      If RegQueryValueEx_(hKey, HKEntry$, 0, 0, @result$, @bufLen) <> #ERROR_SUCCESS
        result$ = "Error reading Registry"
      EndIf
      RegCloseKey_(hKey) 
    EndIf 
  Else
    result$ = "Error opening Registry key"
  EndIf 
  ProcedureReturn result$ 
EndProcedure 

ServiceVersionNumber$ = RegReadString(#HKEY_LOCAL_MACHINE,"SOFTWARE\Microsoft\Internet Explorer","svcUpdateVersion")
patchlevel = Val(Right(ServiceVersionNumber$,3))

regkeyName    = "Software\Microsoft\Internet Explorer\Main\FeatureControl\Feature_Browser_Emulation\"
regkeyName3.s = "Software\Microsoft\Internet Explorer\Main\FeatureControl\FEATURE_RESTRICT_ABOUT_PROTOCOL_IE7\"
dwLabel = GetFilePart(ProgramFilename())

dwValue = 11001  
RegOpenKeyEx_(#HKEY_CURRENT_USER, regkeyName,  0, #KEY_ALL_ACCESS, @keyResult) 
RegSetValueEx_(keyResult, @dwLabel, 0, #REG_DWORD, @dwValue, SizeOf(Long))


UserAgent$ =  "Mozilla/5.0 (Windows NT 6.1; WOW64) like Gecko"  
UrlMkSetSessionOption_( $10000001 ,  Ascii(UserAgent$) ,  Len ( UserAgent$ ) ,  0 )


#OLECMDID_PROPERTIES = 10
#olecmdid_find = 32



DataSection 
  IID_IHTMLElement: ; {3050F1FF-98B5-11CF-BB82-00AA00BDCE0B} 
  Data.l $3050F1FF 
  Data.w $98B5, $11CF 
  Data.b $BB, $82, $00, $AA, $00, $BD, $CE, $0B 
  IID_IHTMLDocument2: ; {332C4425-26CB-11D0-B483-00C04FD90119} 
  Data.l $332C4425 
  Data.w $26CB, $11D0 
  Data.b $B4, $83, $00, $C0, $4F, $D9, $01, $19 
EndDataSection 


CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
  Import ""
    MakeBSTR(str.p-unicode) As "_SysAllocString"
  EndImport
CompilerElse
  Import ""
    MakeBSTR(str.p-unicode) As "SysAllocString"
  EndImport
CompilerEndIf


;; -- begin iDispatch interface functions.  partly adapted from code by freak

#DISPID_NAVIGATEERROR=        271  


Structure DispatchFunctions
  QueryInterface.l
  AddRef.l
  Release.l
  GetTypeInfoCount.l
  GetTypeInfo.l
  GetIDsOfNames.l
  Invoke.l
EndStructure

Structure DispatchObject
  *IDispatch.IDispatch
  ObjectCount.l
EndStructure


Procedure.l AddRef(*THIS.DispatchObject)
  *THIS\ObjectCount + 1
  ProcedureReturn *THIS\ObjectCount
EndProcedure

Procedure.l QueryInterface(*THIS.DispatchObject, *iid.GUID, *Object.LONG)
  
  If CompareMemory(*iid,?IID_DWebBrowserEvents2,16)
    ;         CallDebugger 
  EndIf 
  
  If CompareMemory(*iid, ?IID_IUnknown, SizeOf(GUID)) Or CompareMemory(*iid, ?IID_IDispatch, SizeOf(GUID))
    *Object\l = *THIS
    AddRef(*THIS.DispatchObject)
    ProcedureReturn #S_OK
  Else 
    *Object\l = 0 
    ProcedureReturn #E_NOINTERFACE
  EndIf
EndProcedure

Procedure.l Release(*THIS.DispatchObject)
  *THIS\ObjectCount - 1
  ProcedureReturn *THIS\ObjectCount
EndProcedure

Procedure GetTypeInfoCount(*THIS.DispatchObject, pctinfo)
  ProcedureReturn #E_NOTIMPL
EndProcedure

Procedure GetTypeInfo(*THIS.DispatchObject, iTInfo, lcid, ppTInfo )
  ProcedureReturn #E_NOTIMPL
EndProcedure

Procedure GetIDsOfNames(*THIS.DispatchObject, riid, rgszNames, cNames, lcid, rgDispId) : EndProcedure


Procedure.s StringFromVARIANT(*var.VARIANT)
  
  If VariantChangeType_(*var, *var, $2, #VT_BSTR) = #S_OK
    Result$ = PeekS(*var\bstrVal, -1, #PB_Unicode)
    SysFreeString_(*var\bstrVal)
  Else
    Result$ = "ERROR : Cannot convert VARIANT to String!"
  EndIf
  
  ProcedureReturn Result$
EndProcedure



Global NewList dispatchObject.DispatchObject()


Procedure Invoke(*THIS.DispatchObject, dispIdMember, riid, lcid, wFlags, *pDispParams.DISPPARAMS, pVarResult, pExcepInfo, puArgErr)
  
  Select dispIDMember
  EndSelect 
EndProcedure

AddElement(DispatchObject())
DispatchObject()\IDispatch = ?dispatchFunctions



;/////////////////////////////////////////////////////////////////////////////////
Structure _IDocHostUIHandler
  *vTable
  ref.i
  iDocHostUiHandler.iDocHostUiHandler
EndStructure


Procedure.i SetCustomDocHostUIHandler(id, vTableAddress)
  Protected result=#E_FAIL, hWnd, iBrowser.IWebBrowser2, iDispatch.IDispatch, iDocument.IHTMLDocument2, iOLE.IOleObject, iDocHostUIHandler.IDocHostUIHandler
  Protected iCustomDoc.ICustomDoc, iOLEClientSite.IOleClientSite, *this._IDocHostUIHandler
  hWnd = GadgetID(id)
  If hWnd
    iBrowser = GetWindowLong_(hWnd, #GWL_USERDATA)
    If iBrowser
      If iBrowser\get_Document(@iDispatch) = #S_OK
        If iDispatch\QueryInterface(?IID_IHTMLDocument2, @iDocument) = #S_OK
          If iDocument\QueryInterface(?IID_IOleObject, @iOLE) = #S_OK
            If iOLE\GetClientSite(@iOLEClientSite) = #S_OK
              If iOLEClientSite\QueryInterface(?IID_IDocHostUIHandler, @iDocHostUIHandler) = #S_OK
                If iDocument\QueryInterface(?IID_ICustomDoc, @iCustomDoc) = #S_OK
                  *this = AllocateMemory(SizeOf(_IDocHostUIHandler))
                  If *this
                    *this\vTable = vTableAddress
                    *this\iDocHostUiHandler = iDocHostUIHandler
                    iCustomDoc\SetUIHandler(*this)
                    result = #S_OK
                  Else
                    iDocHostUIHandler\Release()
                  EndIf           
                  iCustomDoc\Release()
                Else
                  iDocHostUIHandler\Release()
                EndIf
              EndIf
              IOleClientSite\Release()
            EndIf
            iOLE\Release()
          EndIf
          iDocument\Release()
        EndIf
        iDispatch\Release()
      EndIf
    EndIf
  EndIf
  ProcedureReturn result
EndProcedure


;/////////////////////////////////////////////////////////////////////////////////
;iUnknown.
Procedure.i IDocHostUIHandler_QueryInterface(*this._IDocHostUIHandler, riid, *ppObj.INTEGER)
  Protected hResult = #E_NOINTERFACE, iunk.iUnknown
  If *ppObj And riid
    *ppObj\i = 0
    If CompareMemory(riid, ?IID_IUnknown, SizeOf(IID)) Or CompareMemory(riid, ?IID_IDocHostUIHandler, SizeOf(IID))
      *ppObj\i = *this
      *this\ref+1
      hResult = #S_OK
    EndIf
  EndIf
  ProcedureReturn hResult
EndProcedure


;iUnknown.
Procedure.i IDocHostUIHandler_AddRef(*this._IDocHostUIHandler)
  *this\ref = *this\ref + 1
  ProcedureReturn *this\ref
EndProcedure


;iUnknown.
Procedure.i IDocHostUIHandler_Release(*this._IDocHostUIHandler)
  Protected refCount
  *this\ref = *this\ref - 1
  refCount = *this\ref
  If *this\ref = 0
    *this\iDocHostUiHandler\Release()
    FreeMemory(*this)
  EndIf
  ProcedureReturn refCount
EndProcedure

Procedure.i IDocHostUIHandler_ShowUI(*this._IDocHostUIHandler, dwID, pActiveObject, pCommandTarget, pFrame, pDoc)
  ProcedureReturn *this\iDocHostUiHandler\ShowUI(dwID, pActiveObject, pCommandTarget, pFrame, pDoc)
EndProcedure


Procedure.i IDocHostUIHandler_HideUI(*this._IDocHostUIHandler)
  ProcedureReturn *this\iDocHostUiHandler\HideUI()
EndProcedure


Procedure.i IDocHostUIHandler_FilterDataObject(*this._IDocHostUIHandler, pDO, ppDORet)
  ProcedureReturn *this\iDocHostUiHandler\FilterDataObject(pDO, ppDORet)
EndProcedure

DataSection
  
  IID_IOleObject: ; 00000112-0000-0000-C000-000000000046
  Data.l $00000112
  Data.w $0000, $0000
  Data.b $C0, $00, $00, $00, $00, $00, $00, $46
  
  IID_IDocHostUIHandler: ; BD3F23C0-D43E-11CF-893B-00AA00BDCE1A
  Data.l $BD3F23C0
  Data.w $D43E, $11CF
  Data.b $89, $3B, $00, $AA, $00, $BD, $CE, $1A
  
  IID_ICustomDoc: ; 3050F3F0-98B5-11CF-BB82-00AA00BDCE0B
  Data.l $3050F3F0
  Data.w $98B5, $11CF
  Data.b $BB, $82, $00, $AA, $00, $BD, $CE, $0B
  
  IID_IHTMLDocument: ; {626FC520-A41E-11CF-A731-00A0C9082637}
  Data.l $626FC520
  Data.w $A41E, $11CF
  Data.b $A7, $31, $00, $A0, $C9, $08, $26, $37
  
  IID_NULL: ; {00000000-0000-0000-0000-000000000000}
  Data.l $00000000
  Data.w $0000, $0000
  Data.b $00, $00, $00, $00, $00, $00, $00, $00       
  
EndDataSection

;; -- end iDispatch interface functions.  partly adapted from code by freak


Procedure.i WebHelpers_GetHTMLDocument2 (nGadget)
  Protected oBrowser.IWebBrowser2 = GetWindowLongPtr_(GadgetID(nGadget), #GWL_USERDATA)
  Protected oDocumentDispatch.IDispatch
  Protected oHTMLDocument.IHTMLDocument2
  Protected iBusy
  
  Repeat
    While WindowEvent(): Delay(0): Wend    
    oBrowser\get_Busy(@iBusy): Delay(10)        
  Until iBusy = #VARIANT_FALSE
  
  If oBrowser
    If oBrowser\get_document(@oDocumentDispatch) = #S_OK 
      If oDocumentDispatch\QueryInterface(?IID_IHTMLDocument2, @oHTMLDocument) = #S_OK
        oDocumentDispatch\Release()
      EndIf
    EndIf
  EndIf
  
  ProcedureReturn oHTMLDocument 
EndProcedure

Procedure.i WebHelpers_GetHTMLDocumentParent (nGadget)
  Protected oHTMLDocument.IHTMLDocument2 = WebHelpers_GetHTMLDocument2 (nGadget)
  Protected oWindow.IHTMLWindow2
  
  If oHTMLDocument
    oHTMLDocument\get_parentWindow(@oWindow)
  EndIf
  
  oHTMLDocument\Release()  
  
  ProcedureReturn oWindow
EndProcedure 


Procedure WebHelpers_InvokeJS (nGadget, sScriptCode.s, sScriptLanguage.s = "JavaScript")    
  Protected oWindow.IHTMLWindow2 = WebHelpers_GetHTMLDocumentParent (nGadget)
  Protected tVariant.VARIANT
  
  If oWindow
    oWindow\execScript ("0"        , sScriptLanguage, @tVariant)
    oWindow\execScript (sScriptCode, sScriptLanguage, @tVariant)
    oWindow\Release()
  EndIf    
EndProcedure 


Procedure.s WebHelpers_GetURL(WGN.l) 
  Protected WebObject.IWebBrowser2,Ptr.l 
  WebObject = GetWindowLong_(GadgetID(WGN), #GWL_USERDATA) 
  WebObject\get_LocationURL(@Ptr.l) 
  ProcedureReturn PeekS(Ptr)
EndProcedure


Procedure.s WebHelpers_GetJSV(Gadget, Name$)
  Result$ = "ERROR" 
  
  Browser.IWebBrowser2 = GetWindowLong_(GadgetID(Gadget), #GWL_USERDATA)
  If Browser\get_Document(@DocumentDispatch.IDispatch) = #S_OK
    If DocumentDispatch\QueryInterface(?IID_IHTMLDocument, @Document.IHTMLDocument) = #S_OK
      If Document\get_Script(@Script.IDispatch) = #S_OK
        
        bstr_name = MakeBSTR(Name$)
        result = Script\GetIDsOfNames(?IID_NULL, @bstr_name, 1, 0, @dispID.l)
        If result = #S_OK
          
          params.DISPPARAMS\cArgs = 0
          params\cNamedArgs = 0        
          
          result = Script\Invoke(dispID, ?IID_NULL, 0, #DISPATCH_PROPERTYGET, @params, @varResult.VARIANT, 0, 0)
          If result = #S_OK
            Result$ = StringFromVARIANT(@varResult)
          Else
            Message$ = Space(3000)
            FormatMessage_(#FORMAT_MESSAGE_IGNORE_INSERTS|#FORMAT_MESSAGE_FROM_SYSTEM, 0, result, 0, @Message$, 3000, 0)          
            Result$ = "ERROR: Invoke() "+Message$            
          EndIf
          
        Else
          Message$ = Space(3000)
          FormatMessage_(#FORMAT_MESSAGE_IGNORE_INSERTS|#FORMAT_MESSAGE_FROM_SYSTEM, 0, result, 0, @Message$, 3000, 0)          
          Result$ = "ERROR: GetIDsOfNames() "+Message$          
          
        EndIf
        SysFreeString_(bstr_name)
        
        Script\Release()
      EndIf
      Document\Release()
    EndIf
    DocumentDispatch\Release()
  EndIf
  
  ProcedureReturn Result$
EndProcedure


Procedure.s WebHelpers_GetInnerText(webgadget)   ;;;  JavaScript code compatible with IE11 and CEF
  
  script2.s + "var jResult=document.documentElement.innerText;     " 
  script2.s + "if (window.frames.length) { "
  script2.s + "  for (var xx = 0 ; xx < window.frames.length ; xx++) "
  script2.s + "    {jResult = jResult + '\n\n' + window.frames[xx].document.documentElement.innerText;}" 
  script2.s + "} "
  script2.s + "jResult; "
  WebHelpers_InvokeJS (webgadget, script2.s) 
  
  ProcedureReturn WebHelpers_GetJSV(webgadget,"jResult")
EndProcedure




DataSection
  
  dispatchFunctions:
  Data.i @QueryInterface(),@AddRef(),@Release(),@GetTypeInfoCount()
  Data.i @GetTypeInfo(),@GetIDsOfNames(),@Invoke()
  
  IID_IWebBrowser2:
  Data.l $D30C1661
  Data.w $CDAF, $11D0
  Data.b $8A, $3E, $00, $C0, $4F, $C9, $E2, $6E
  
  IID_IConnectionPointContainer:
  Data.l $B196B284
  Data.w $BAB4, $101A
  Data.b $B6, $9C, $00, $AA, $00, $34, $1D, $07   
  
  IID_IDispatch:
  Data.l $00020400
  Data.w $0000, $0000
  Data.b $C0, $00, $00, $00, $00, $00, $00, $46
  
  IID_IUnknown:
  Data.l $00000000
  Data.w $0000, $0000
  Data.b $C0, $00, $00, $00, $00, $00, $00, $46
  
  IID_DWebBrowserEvents2:
  Data.l $34A715A0
  Data.w $6587, $11D0
  Data.b $92, $4A, $00, $20, $AF, $C7, $AC, $4D
EndDataSection

Procedure.s RemoveTagsFromString(texttoprocess$)
  
  OpenWindow(99, 100, 100, 140, 80, "",#PB_Window_Invisible)
  WebGadget(0, 0, 0, 100, 100, "")
  SetGadgetItemText(0, #PB_Web_HtmlCode, texttoprocess$)
  
  ProcedureReturn  WebHelpers_GetInnerText(0)
EndProcedure   



;;;  demo  ;;;

Debug RemoveTagsFromString("<B>Hello </B> <a href=https://www.google.com/>Click here</a>")

Post Reply