Holyshit it's hard.
Anyone have any tips?
The most I've gotten to is being able to parse the HTML for all of the input field's names and values, but how can I grab them one by one?
My current code:
ThanksCode:import org.jsoup.Jsoup; import org.jsoup.helper.Validate; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.io.IOException; class JSoupTest { public static void main(String[] args) throws IOException { Document doc = Jsoup.connect("http://gaiaonline.com/auth").get(); Elements inputs = doc.select("sid|input"); for(Element input : inputs) { System.out.println(input.attr("name")); System.out.println(input.attr("value")); } } }
Results 1 to 20 of 20
- 12 Dec. 2012 01:26pm #1
- Join Date
- Apr. 2010
- Location
- When freedom is outlawed only outlaws will be free
- Posts
- 5,113
- Reputation
- 195
- LCash
- 1442.00
Grabbing values in HTML using a programming languge is harder than I imagined
- 12 Dec. 2012 03:58pm #2
Use RegEx? <input...name="etc"...value="etc"(?: \/)?>
And then run it again with value and name reversed, if you have to. value="etc"...name="etc" but I doubt they use that markup.
- 12 Dec. 2012 04:29pm #3
I bring two gifts! Possibly more.
These are both for Java. Have fun.
PHP Code:public String Between(String strString, String strStart, String strEnd)
{
int intBegin = strString.indexOf(strStart) + strStart.length();
int intEnd = strString.indexOf(strEnd, intBegin + 1);
return strString.substring(intBegin, intEnd);
}
public List<String> GetAll(String Input, String Start, String End){
List<String> Values = new ArrayList<String>();
int Offset = 0;
while(true){
if(Input.length() > 0 && Start.length() > 0 && End.length() > 0 && Offset < Input.length()){
int StartPos = (Input.indexOf(Start, Offset) + Start.length());
if((StartPos - Start.length()) > -1 && Input.length() >= StartPos){
int Length = (Input.indexOf(End, StartPos) - StartPos);
if(Length >-1){
Values.add(Input.substring(StartPos, Length));
Offset = StartPos + Length;
continue;
}
}
}
break;
}
return Values;
}
http://forum.logicalgamers.com/sourc...pwrappers.html
- 12 Dec. 2012 05:43pm #4
Just use what Chad suggested. It's what we used back in the day at least. The fabled string between methods.
Using regex for parsing HTML is bad practice anyway.
- 12 Dec. 2012 05:51pm #5
Also, jSoup? Is it using browser DOM to load the page? Or is it extracting the DOM from the HTTP medium?
Is that some sort of parsing library? If it works the way I think it does you don't need to parse anything. All the values you need are present in said DOM and all you need to do is access them. Or you could just send user and password and the JavaScript will handle the rest being that the required values are in the page.
I may look into this. I could be wrong depending on what kind of library that is.
- 12 Dec. 2012 06:35pm #6
- Join Date
- Apr. 2010
- Location
- When freedom is outlawed only outlaws will be free
- Posts
- 5,113
- Reputation
- 195
- LCash
- 15.00
Its a library for fetching and parsing HTML pages, it's not too bad, it's fairly powerful if your looking to get statistics of links on a page, or if you wanted to find out how many hidden input values were on a page, or something like that. It's interesting, just not as helpful as I thought it would be.
@Chad: Thanks a ton, I'll probably be using those from now on.
EDIT: Well, I obviously suck at Java when I can't even figure out how to use the HTTP wrappers. Give me a bit, I'm half retarded.Last edited by 323; 12 Dec. 2012 at 06:41pm.
- 13 Dec. 2012 06:41am #7
- 13 Dec. 2012 07:14am #8
Yes.
No.
I read this article a while back.
If there's a simpler way to get necessary values, use it. Much more goes on behind the hood with regular expressions than a simple string search.
- 13 Dec. 2012 03:43pm #9
- Join Date
- Apr. 2010
- Location
- When freedom is outlawed only outlaws will be free
- Posts
- 5,113
- Reputation
- 195
- LCash
- 371.00
Anyway, anyone want to teach me how to use those HTTP wrappers? I think I get it, but don't want to have to do three hours of guessing and checking when someone can just tell me.
- 13 Dec. 2012 03:55pm #10
I've already given you pretty concrete HTTP wrappers. It isn't necessary to use another set.
Just include or import the library, instantiate the class. You're done.
Edit: I posted an example before. https://github.com/Isonyx/HTTPReques...rc/Tester.java
Just add the functions Chad posted to the wrapper or to your main Java file. Or somewhere else where the methods can be accessed.Last edited by The Unintelligible; 13 Dec. 2012 at 04:02pm.
- 13 Dec. 2012 05:23pm #11
- Join Date
- Apr. 2010
- Location
- When freedom is outlawed only outlaws will be free
- Posts
- 5,113
- Reputation
- 195
- LCash
- 383.00
I was already using those HTTP Wrappers that you had given me before, the ones by Isonyx. How would I add those functions though? Reading chads code, I barely even know what they do :/
- 13 Dec. 2012 05:55pm #12
- 13 Dec. 2012 07:23pm #13
It depends on what you are using it for. Not all input fields can be found with a string search, such as if you don't know the order of the parameters, or if the parameters change each page load. RegExp can do more than look for all the contents between two literal strings, and oftentimes that is either necessary or decreases the time it takes to code or find. It's much more extensible, and especially given how this is being used to gather data from a website that is updated by a third party, extensibility is much more important than the microseconds saved by using a string search.
- 13 Dec. 2012 07:48pm #14
Like I said, if there's a simpler way to parse text, use it. Your main point is contingent on the scenario of the user. Of course regular expressions are more apt for certain tasks, mainly being actual pattern matching (its original and primary purpose. e.g. e-mail validation is a task I'd use regex or some form of pattern matching for) and not "pure" string parsing (e.g. getting certain values in a body of text subject to change like HTML).
Generally speaking, avoid using RegExp whenever possible. As a rule of thumb I try to make RegExp a last resort kind of thing. Not use it because it's simply more convenient for the task at hand.
Edit: Everything you've said has also been addressed in the blog post I hotlinked. I'm assuming you haven't read it. HTML parsers are always the better choice for parsing HTML or markup. RegExp in some cases is better than a simple string search (depending on the complexity of the task) but never is it the better choice for parsing HTML.
Actually, to put it simply, as you said it depends on what you're using it for. If I'm going to parse simple HTML I'm going to use some other means of string parsing like string searching. If I'm going to parse something where I'm looking for more than string literals I'm going to use regex or an HTML parser (most likely the latter).
RegExp has poor readability and thus poor manageability. It has a steeper learning curve. It's also somewhat slower in execution. It's a hazard and pitfall you should often avoid.
Though it is probably a lot more extensible than string searching.Last edited by The Unintelligible; 13 Dec. 2012 at 08:55pm.
- 14 Dec. 2012 08:26am #15
I'm going to not read the marked out part. If there's anything insulting in it, someone tell me, and I'll rip Untinkerbell a new one.
I'm pretty sure RegExp is much faster than an HTML parser, at least for things as simple as finding values for attributes. RegExp doesn't get slow until you do shit like lookbacks or what-not. A basic value="(\d+)" or what have you will go way faster than parsing the DOM, but a parser would be way safer and easier to code, suld be more useful if you were scanning multiple values.
Just based on the OP, getting a single value would be easier/faster/more-extensible with RegExp than a string search; and I'd say a DOM parser would be great, but that's assuming Gaia uses valid markup, and I highly doubt they do. The only people you can really trust to use an HTML parser on would be something you wrote yourself.
- 14 Dec. 2012 04:36pm #16
- 14 Dec. 2012 05:28pm #17
Lol, you've been pretty paranoid lately. Just read the marked out part regardless of whether it's insulting or not. It provides more insight to my points. Though just as an FYI, it isn't insulting.
If you choose not to it doesn't really matter though, it's not my main point or anything. I wish I said something insulting in it though in hindsight, I wanted to be ripped a new one. This old one is getting, old. ):
Bottom line is I would not consider RegExp as my first option for parsing HTML. To each his own though. Aside from that the Regex you gave OP wouldn't even work for their current login format unless I'm mistaken.
In conclusion, I'd typically use an HTML parser or something of the like to parse HTML. RegExp as a last resort.
- 14 Dec. 2012 05:30pm #18
- 14 Dec. 2012 06:23pm #19
It was like 4am or some shit. I didn't have time to read it. Nor now.
Bottom line is I would not consider RegExp as my first option for parsing HTML.
e.g.
name="whatiwant" value="this value"
becoming
name="whatiwant" size="42" value="this value"
- 15 Dec. 2012 05:53pm #20