Extract link from a HTML file

Just starting out? Need help? Post your questions and find answers here.
bidanh00co
User
User
Posts: 56
Joined: Thu Jul 07, 2005 10:06 am

Extract link from a HTML file

Post by bidanh00co »

I have a HTML file which contains many website link. If I copy&Paste link from them...it may be a fool way :P..

How could we make a program which can extract all Link from a HTML file ??... I know the link alway begin with "http://www." and end with "
Killswitch
Enthusiast
Enthusiast
Posts: 731
Joined: Wed Apr 21, 2004 7:12 pm

Post by Killswitch »

Code: Select all

While Eof(0)=0
  
  String.s+ReadString()

Wend

Repeat
  
  Debug Mid(String,FindString(String,"http://",1),FindString(String,Chr(34),1)
  String=Mid(String,FindString(String,Chr(34),1)+1,Len(String))

Until String=""
Untested, but thats the general idea.
~I see one problem with your reasoning: the fact is thats not a chicken~
User avatar
CONVERT
Enthusiast
Enthusiast
Posts: 127
Joined: Fri May 02, 2003 12:19 pm
Location: France

Post by CONVERT »

Some parts of these procedures may help you. They are looking for image file name in .html files.

wfin$ can contain chr(34) for you.
wpos can contain the position in the string, after http//www.

Code: Select all

Procedure.s extrait(enr$,wpos,wfin$)
res$ = ""

wposg = FindString(enr$,wfin$,wpos)
If wposg = wpos
  wlib$ = wfile$ +  Chr(13)
  wlib$ = wlib$ + " impossible de trouver " + wfin$ + " dans extrait()" + Chr(13)
  wlib$ = wlib$ + "dans : " + enr$ + Chr(13)
  wlib$ = wlib$ + "ERREUR FATALE."
  mess(wlib$)
  End
EndIf

res$ = Mid(enr$,wpos,wposg - wpos)

ProcedureReturn res$
EndProcedure
In the following code, calling the precedent one, you can replace ,"<img src=" + G$ + "_", by ,"http://www",.

You can replace ".jpg" by chr(34).

(G$ contains chr(34))

Code: Select all

Procedure Rech_Image(wdir_current_orig$, wfile_htm$)

wfile$ = wdir_current_orig$ + "\" + wfile_htm$

wres = ReadFile(GN_htm,wfile$)
If wres = 0
  Mess(wfile$ + " NON OUVERT dans Rech_Image(). Erreur fatale.")
  End
EndIf

While Eof(GN_htm) = 0
  enr$ = ReadString()
  wpos = FindString(enr$,"<img src=" + G$ + "_",1)
  If wpos <> 0
    wnom_numero$ = extrait(enr$,wpos + 10,".jpg")    ; ----------------   IMAGES
                                                     ; vignettes 00001_v.jpg non traités
                                                     ; cas de contact1.html général
    If wnom_numero$ = ""
      wlib$ = wfile$ +  Chr(13)
      wlib$ = wlib$ + " impossible d'extraire le nom 'numero' de l'image" + Chr(13)
      wlib$ = wlib$ + "dans : " + enr$ + Chr(13)
      wlib$ = wlib$ + "ERREUR FATALE."
      mess(wlib$)
      End
    Else
      wpos2 = FindString(enr$,"alt=" + G$, wpos + 14)
      If wpos2 <> 0
        wnom_definitif$ = extrait(enr$,wpos2 +5,G$)
        If wnom_definitif$ = ""
          wlib$ = wfile$ +  Chr(13)
          wlib$ = wlib$ + "Pour '<img src=' : " + wnom_numero$ + " à partir de " + Str(wpos + 14) + Chr(13)
          wlib$ = wlib$ + " impossible d'extraire le nom 'alt=' de l'image" + Chr(13)
          wlib$ = wlib$ + "dans : " + enr$ + Chr(13)
          wlib$ = wlib$ + "ERREUR FATALE."
          mess(wlib$)
          End
        Else
          If GInum > GInum_maxi
            wlib$ = wfile$ +  Chr(13)
            wlib$ = wlib$ + "Nombre d'images supérieur à " + Str(GInum_maxi) + Chr(13)
            wlib$ = wlib$ + "dans : " + enr$ + Chr(13)
            wlib$ = wlib$ + "ERREUR FATALE."
            mess(wlib$)
            End
          EndIf
          
          Tnum$(GInum) = wnom_numero$
          Tnom$(GInum) = wnom_definitif$
          GInum = GInum + 1
        EndIf
      EndIf
    EndIf
  EndIf
Wend

CloseFile(GN_htm)

ProcedureReturn
EndProcedure
PureBasic 6.01 LTS 64 bit | Windows 10 Pro x64 | Intel(R) Core(TM) i7-8700 CPU @ 3.20Ghz 16 GB RAM, SSD 500 GB, PC locally assembled
bidanh00co
User
User
Posts: 56
Joined: Thu Jul 07, 2005 10:06 am

Post by bidanh00co »

Thank CONVERT & KillSwitch much !

I tried CONVERT's code first. .... a little difficult (your comment is not English :)

When I tested it, the Error reports that "mess() is not a function, an array, or linked list"

The exact error comes from the line "mess(wlib$) "

Sorry for my stupid question, I really do not understand about your code ... :(

- I can catch the idea of KillSwitch. :)
dracflamloc
Addict
Addict
Posts: 1648
Joined: Mon Sep 20, 2004 3:52 pm
Contact:

Post by dracflamloc »

Um, if you have the html source code, just look for "<a href=" and grab upto "</a>".

This will give you every link.
User avatar
CONVERT
Enthusiast
Enthusiast
Posts: 127
Joined: Fri May 02, 2003 12:19 pm
Location: France

Post by CONVERT »

Try this more pertinent code:

Code: Select all

Procedure.s extract(enr$,wpos,wfin$) 
res$ = "" 

wposg = FindString(enr$,wfin$,wpos) 
If wposg <> wpos And wposg <> 0
  res$ = Mid(enr$,wpos,wposg - wpos) 
EndIf

ProcedureReturn res$ 
EndProcedure 


Procedure Look_url(wdir_current_orig$, wfile_htm$,wno_out) 

wfile$ = wdir_current_orig$ + "\" + wfile_htm$ 

wno_in = ReadFile(#PB_Any,wfile$) 
If wno_in = 0 
  MessageRequester("Error",wfile$ + " not opened.",0) 
  End 
EndIf 

While Eof(wno_in) = 0 
  enr$ = ReadString()
  wpos = FindString(enr$,"http://",1) 
  If wpos <> 0 
    wurl$ = extract(enr$,wpos + 7,Chr(34)) 
    If wurl$ <> "" 
      UseFile(wno_out)
      WriteStringN(wurl$)
      UseFile(wno_in)
    EndIf
  EndIf
Wend 

CloseFile(wno_in) 

ProcedureReturn 
EndProcedure 

;- BEGIN

infile$ = "your file.html"
current_dir$ = "your directory"

wno_out = CreateFile(#PB_Any,"out.txt")
If wno_out = 0
  MessageRequester("Error","out.txt not created.",0)
  End
EndIf

Look_url(current_dir$, infile$,wno_out) 

CloseFile(wno_out)

End
The result is in OUT.TXT file in the current directory.
PureBasic 6.01 LTS 64 bit | Windows 10 Pro x64 | Intel(R) Core(TM) i7-8700 CPU @ 3.20Ghz 16 GB RAM, SSD 500 GB, PC locally assembled
Dare2
Moderator
Moderator
Posts: 3321
Joined: Sat Dec 27, 2003 3:55 am
Location: Great Southern Land

Post by Dare2 »

Assuming you are looking at an HTML source:

Code: Select all

html.s="Say this is where all your links are embedded."
html+" For example <a href="+Chr(34)+"apage.html"+Chr(34)+">Click text</a>"
html+" and sans quotes <A HREF=http://www.google.com>GOOGLE</A>"
html+" and perhaps <a href="+Chr(34)+"javascript:doWeReallyWantThis(var);"+Chr(34)+">JS Call</a>"

Debug html

; This loop pulls all hyperlinks. Handle result in "found" as needed.

p=1
Repeat
  p=FindString(UCase(html),"<A",p)
  If p
    p=FindString(html,"=",p)+1
    e=FindString(html,">",p)
    found.s=Trim(ReplaceString(Mid(html,p,e-p),Chr(34),""))
    Debug found
    p=e
  EndIf
Until p=0
@}--`--,-- A rose by any other name ..
ricardo
Addict
Addict
Posts: 2402
Joined: Fri Apr 25, 2003 7:06 pm
Location: Argentina

Post by ricardo »

Dare2 wrote:

Code: Select all

Repeat
  p=FindString(UCase(html),"<A",p)
  If p
    p=FindString(html,"=",p)+1
    e=FindString(html,">",p)
    found.s=Trim(ReplaceString(Mid(html,p,e-p),Chr(34),""))
    Debug found
    p=e
  EndIf
Until p=0
I think this is faster, right?

Code: Select all

Repeat
  p=FindString(UCase(html),"<A HREF=",p)
  If p
    p+9
    e=FindString(html,CHR(34),p+1)
    found.s=Mid(html,p,e-p)
    Debug found
    p=e
  EndIf
Until p=0
ARGENTINA WORLD CHAMPION
Dare2
Moderator
Moderator
Posts: 3321
Joined: Sat Dec 27, 2003 3:55 am
Location: Great Southern Land

Post by Dare2 »

Yes. :) And also avoids <A NAME= situations.

And the p=e can go as well.

(I must have released a beta!)
@}--`--,-- A rose by any other name ..
bidanh00co
User
User
Posts: 56
Joined: Thu Jul 07, 2005 10:06 am

Post by bidanh00co »

Thank CONVERT, Dare2 and Ricardo much !

I test all and it works very well. That's cool!

But, In case I just want to extract a link which ended with file exstension such as: *.MOV, *.AVI , *.WMA....?? How can I modify these code ??
rsts
Addict
Addict
Posts: 2736
Joined: Wed Aug 24, 2005 8:39 am
Location: Southwest OH - USA

Post by rsts »

bidanh00co wrote: How can I modify these code ??

1st - Do you know anything about programming?

:)
bidanh00co
User
User
Posts: 56
Joined: Thu Jul 07, 2005 10:06 am

Post by bidanh00co »

Hi rsts !

I'm just learn programming...really bad :P

Of course I post the question is "How to modify these code to make it possible to extract the link which ended with special extension: MOV, AVI, WMV..."

You know, I have many ideas, but sometimes I can not find the way, I post it here, even I can't fully understand all the code all of you provide.
Dare2
Moderator
Moderator
Posts: 3321
Joined: Sat Dec 27, 2003 3:55 am
Location: Great Southern Land

Post by Dare2 »

:)

How about using Right() to check the last 4 characters of an extracted link to see if it is ".AVI"?
@}--`--,-- A rose by any other name ..
User avatar
CONVERT
Enthusiast
Enthusiast
Posts: 127
Joined: Fri May 02, 2003 12:19 pm
Location: France

Post by CONVERT »

Add this code at the end of my extract procedure:

Code: Select all

if right(res$,4) <> ".xxx"
  res$ = ""
endif
PureBasic 6.01 LTS 64 bit | Windows 10 Pro x64 | Intel(R) Core(TM) i7-8700 CPU @ 3.20Ghz 16 GB RAM, SSD 500 GB, PC locally assembled
User avatar
CONVERT
Enthusiast
Enthusiast
Posts: 127
Joined: Fri May 02, 2003 12:19 pm
Location: France

Post by CONVERT »

Or better:

Code: Select all

If Ucase(Right(res$,4)) <> ".XXX"
  res$ = ""
Endif
PureBasic 6.01 LTS 64 bit | Windows 10 Pro x64 | Intel(R) Core(TM) i7-8700 CPU @ 3.20Ghz 16 GB RAM, SSD 500 GB, PC locally assembled
Post Reply