Extract link from a HTML file
-
- User
- Posts: 56
- Joined: Thu Jul 07, 2005 10:06 am
Extract link from a HTML file
I have a HTML file which contains many website link. If I copy&Paste link from them...it may be a fool way ..
How could we make a program which can extract all Link from a HTML file ??... I know the link alway begin with "http://www." and end with "
How could we make a program which can extract all Link from a HTML file ??... I know the link alway begin with "http://www." and end with "
-
- Enthusiast
- Posts: 731
- Joined: Wed Apr 21, 2004 7:12 pm
Code: Select all
While Eof(0)=0
String.s+ReadString()
Wend
Repeat
Debug Mid(String,FindString(String,"http://",1),FindString(String,Chr(34),1)
String=Mid(String,FindString(String,Chr(34),1)+1,Len(String))
Until String=""
~I see one problem with your reasoning: the fact is thats not a chicken~
Some parts of these procedures may help you. They are looking for image file name in .html files.
wfin$ can contain chr(34) for you.
wpos can contain the position in the string, after http//www.
In the following code, calling the precedent one, you can replace ,"<img src=" + G$ + "_", by ,"http://www",.
You can replace ".jpg" by chr(34).
(G$ contains chr(34))
wfin$ can contain chr(34) for you.
wpos can contain the position in the string, after http//www.
Code: Select all
Procedure.s extrait(enr$,wpos,wfin$)
res$ = ""
wposg = FindString(enr$,wfin$,wpos)
If wposg = wpos
wlib$ = wfile$ + Chr(13)
wlib$ = wlib$ + " impossible de trouver " + wfin$ + " dans extrait()" + Chr(13)
wlib$ = wlib$ + "dans : " + enr$ + Chr(13)
wlib$ = wlib$ + "ERREUR FATALE."
mess(wlib$)
End
EndIf
res$ = Mid(enr$,wpos,wposg - wpos)
ProcedureReturn res$
EndProcedure
You can replace ".jpg" by chr(34).
(G$ contains chr(34))
Code: Select all
Procedure Rech_Image(wdir_current_orig$, wfile_htm$)
wfile$ = wdir_current_orig$ + "\" + wfile_htm$
wres = ReadFile(GN_htm,wfile$)
If wres = 0
Mess(wfile$ + " NON OUVERT dans Rech_Image(). Erreur fatale.")
End
EndIf
While Eof(GN_htm) = 0
enr$ = ReadString()
wpos = FindString(enr$,"<img src=" + G$ + "_",1)
If wpos <> 0
wnom_numero$ = extrait(enr$,wpos + 10,".jpg") ; ---------------- IMAGES
; vignettes 00001_v.jpg non traités
; cas de contact1.html général
If wnom_numero$ = ""
wlib$ = wfile$ + Chr(13)
wlib$ = wlib$ + " impossible d'extraire le nom 'numero' de l'image" + Chr(13)
wlib$ = wlib$ + "dans : " + enr$ + Chr(13)
wlib$ = wlib$ + "ERREUR FATALE."
mess(wlib$)
End
Else
wpos2 = FindString(enr$,"alt=" + G$, wpos + 14)
If wpos2 <> 0
wnom_definitif$ = extrait(enr$,wpos2 +5,G$)
If wnom_definitif$ = ""
wlib$ = wfile$ + Chr(13)
wlib$ = wlib$ + "Pour '<img src=' : " + wnom_numero$ + " à partir de " + Str(wpos + 14) + Chr(13)
wlib$ = wlib$ + " impossible d'extraire le nom 'alt=' de l'image" + Chr(13)
wlib$ = wlib$ + "dans : " + enr$ + Chr(13)
wlib$ = wlib$ + "ERREUR FATALE."
mess(wlib$)
End
Else
If GInum > GInum_maxi
wlib$ = wfile$ + Chr(13)
wlib$ = wlib$ + "Nombre d'images supérieur à " + Str(GInum_maxi) + Chr(13)
wlib$ = wlib$ + "dans : " + enr$ + Chr(13)
wlib$ = wlib$ + "ERREUR FATALE."
mess(wlib$)
End
EndIf
Tnum$(GInum) = wnom_numero$
Tnom$(GInum) = wnom_definitif$
GInum = GInum + 1
EndIf
EndIf
EndIf
EndIf
Wend
CloseFile(GN_htm)
ProcedureReturn
EndProcedure
PureBasic 6.01 LTS 64 bit | Windows 10 Pro x64 | Intel(R) Core(TM) i7-8700 CPU @ 3.20Ghz 16 GB RAM, SSD 500 GB, PC locally assembled
-
- User
- Posts: 56
- Joined: Thu Jul 07, 2005 10:06 am
Thank CONVERT & KillSwitch much !
I tried CONVERT's code first. .... a little difficult (your comment is not English
When I tested it, the Error reports that "mess() is not a function, an array, or linked list"
The exact error comes from the line "mess(wlib$) "
Sorry for my stupid question, I really do not understand about your code ...
- I can catch the idea of KillSwitch.
I tried CONVERT's code first. .... a little difficult (your comment is not English
When I tested it, the Error reports that "mess() is not a function, an array, or linked list"
The exact error comes from the line "mess(wlib$) "
Sorry for my stupid question, I really do not understand about your code ...
- I can catch the idea of KillSwitch.
-
- Addict
- Posts: 1648
- Joined: Mon Sep 20, 2004 3:52 pm
- Contact:
Try this more pertinent code:
The result is in OUT.TXT file in the current directory.
Code: Select all
Procedure.s extract(enr$,wpos,wfin$)
res$ = ""
wposg = FindString(enr$,wfin$,wpos)
If wposg <> wpos And wposg <> 0
res$ = Mid(enr$,wpos,wposg - wpos)
EndIf
ProcedureReturn res$
EndProcedure
Procedure Look_url(wdir_current_orig$, wfile_htm$,wno_out)
wfile$ = wdir_current_orig$ + "\" + wfile_htm$
wno_in = ReadFile(#PB_Any,wfile$)
If wno_in = 0
MessageRequester("Error",wfile$ + " not opened.",0)
End
EndIf
While Eof(wno_in) = 0
enr$ = ReadString()
wpos = FindString(enr$,"http://",1)
If wpos <> 0
wurl$ = extract(enr$,wpos + 7,Chr(34))
If wurl$ <> ""
UseFile(wno_out)
WriteStringN(wurl$)
UseFile(wno_in)
EndIf
EndIf
Wend
CloseFile(wno_in)
ProcedureReturn
EndProcedure
;- BEGIN
infile$ = "your file.html"
current_dir$ = "your directory"
wno_out = CreateFile(#PB_Any,"out.txt")
If wno_out = 0
MessageRequester("Error","out.txt not created.",0)
End
EndIf
Look_url(current_dir$, infile$,wno_out)
CloseFile(wno_out)
End
PureBasic 6.01 LTS 64 bit | Windows 10 Pro x64 | Intel(R) Core(TM) i7-8700 CPU @ 3.20Ghz 16 GB RAM, SSD 500 GB, PC locally assembled
Assuming you are looking at an HTML source:
Code: Select all
html.s="Say this is where all your links are embedded."
html+" For example <a href="+Chr(34)+"apage.html"+Chr(34)+">Click text</a>"
html+" and sans quotes <A HREF=http://www.google.com>GOOGLE</A>"
html+" and perhaps <a href="+Chr(34)+"javascript:doWeReallyWantThis(var);"+Chr(34)+">JS Call</a>"
Debug html
; This loop pulls all hyperlinks. Handle result in "found" as needed.
p=1
Repeat
p=FindString(UCase(html),"<A",p)
If p
p=FindString(html,"=",p)+1
e=FindString(html,">",p)
found.s=Trim(ReplaceString(Mid(html,p,e-p),Chr(34),""))
Debug found
p=e
EndIf
Until p=0
@}--`--,-- A rose by any other name ..
I think this is faster, right?Dare2 wrote:Code: Select all
Repeat p=FindString(UCase(html),"<A",p) If p p=FindString(html,"=",p)+1 e=FindString(html,">",p) found.s=Trim(ReplaceString(Mid(html,p,e-p),Chr(34),"")) Debug found p=e EndIf Until p=0
Code: Select all
Repeat
p=FindString(UCase(html),"<A HREF=",p)
If p
p+9
e=FindString(html,CHR(34),p+1)
found.s=Mid(html,p,e-p)
Debug found
p=e
EndIf
Until p=0
ARGENTINA WORLD CHAMPION
-
- User
- Posts: 56
- Joined: Thu Jul 07, 2005 10:06 am
-
- User
- Posts: 56
- Joined: Thu Jul 07, 2005 10:06 am
Hi rsts !
I'm just learn programming...really bad
Of course I post the question is "How to modify these code to make it possible to extract the link which ended with special extension: MOV, AVI, WMV..."
You know, I have many ideas, but sometimes I can not find the way, I post it here, even I can't fully understand all the code all of you provide.
I'm just learn programming...really bad
Of course I post the question is "How to modify these code to make it possible to extract the link which ended with special extension: MOV, AVI, WMV..."
You know, I have many ideas, but sometimes I can not find the way, I post it here, even I can't fully understand all the code all of you provide.
Add this code at the end of my extract procedure:
Code: Select all
if right(res$,4) <> ".xxx"
res$ = ""
endif
PureBasic 6.01 LTS 64 bit | Windows 10 Pro x64 | Intel(R) Core(TM) i7-8700 CPU @ 3.20Ghz 16 GB RAM, SSD 500 GB, PC locally assembled
Or better:
Code: Select all
If Ucase(Right(res$,4)) <> ".XXX"
res$ = ""
Endif
PureBasic 6.01 LTS 64 bit | Windows 10 Pro x64 | Intel(R) Core(TM) i7-8700 CPU @ 3.20Ghz 16 GB RAM, SSD 500 GB, PC locally assembled