Grabbing values in HTML using a programming languge is harder than I imagined

**323** · 12 Dec. 2012 01:26pm

Holyshit it's hard.

Anyone have any tips?

The most I've gotten to is being able to parse the HTML for all of the input field's names and values, but how can I grab them one by one?

My current code:

Code:

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

class JSoupTest {
    public static void main(String[] args) throws IOException {
        Document doc = Jsoup.connect("http://gaiaonline.com/auth").get();
        Elements inputs = doc.select("sid|input");
        for(Element input : inputs) {
            System.out.println(input.attr("name"));
            System.out.println(input.attr("value"));
        }
    }
}

Thanks

**GAMEchief** · 12 Dec. 2012 03:58pm

Use RegEx? <input...name="etc"...value="etc"(?: \/)?>
And then run it again with value and name reversed, if you have to. value="etc"...name="etc" but I doubt they use that markup.

**Chad** · 12 Dec. 2012 04:29pm

I bring two gifts! Possibly more.
These are both for Java. Have fun.

PHP Code:

 public String Between(String strString, String strStart, String strEnd)
    {
        int intBegin = strString.indexOf(strStart) + strStart.length();
        int intEnd = strString.indexOf(strEnd, intBegin + 1);
        return strString.substring(intBegin, intEnd);
    }

public List<String> GetAll(String Input, String Start, String End){
    List<String> Values = new ArrayList<String>();
    int Offset = 0;
    
    while(true){
       if(Input.length() > 0 && Start.length() > 0 && End.length() > 0 && Offset < Input.length()){
            int StartPos = (Input.indexOf(Start, Offset) + Start.length());
            if((StartPos - Start.length()) > -1 && Input.length() >= StartPos){
                int Length = (Input.indexOf(End, StartPos) - StartPos);
                if(Length >-1){
                    Values.add(Input.substring(StartPos, Length));
                    Offset = StartPos + Length;
                    continue;
                }
            }
       }
       break;
    }
    return Values;
}

Edit: I posted these way back in the day, I suggest you look into it.

http://forum.logicalgamers.com/sourc...pwrappers.html

**The Unintelligible** · 12 Dec. 2012 05:43pm

Just use what Chad suggested. It's what we used back in the day at least. The fabled string between methods.

Using regex for parsing HTML is bad practice anyway.

**The Unintelligible** · 12 Dec. 2012 05:51pm

Also, jSoup? Is it using browser DOM to load the page? Or is it extracting the DOM from the HTTP medium?

Is that some sort of parsing library? If it works the way I think it does you don't need to parse anything. All the values you need are present in said DOM and all you need to do is access them. Or you could just send user and password and the JavaScript will handle the rest being that the required values are in the page.

I may look into this. I could be wrong depending on what kind of library that is.

**323** · 12 Dec. 2012 06:35pm

Originally Posted by The Unintelligible

Also, jSoup? Is it using browser DOM to load the page? Or is it extracting the DOM from the HTTP medium?

Is that some sort of parsing library? If it works the way I think it does you don't need to parse anything. All the values you need are present in said DOM and all you need to do is access them. Or you could just send user and password and the JavaScript will handle the rest being that the required values are in the page.

I may look into this. I could be wrong depending on what kind of library that is.

Its a library for fetching and parsing HTML pages, it's not too bad, it's fairly powerful if your looking to get statistics of links on a page, or if you wanted to find out how many hidden input values were on a page, or something like that. It's interesting, just not as helpful as I thought it would be.

@Chad: Thanks a ton, I'll probably be using those from now on.

EDIT: Well, I obviously suck at Java when I can't even figure out how to use the HTTP wrappers. Give me a bit, I'm half retarded.

**GAMEchief** · 13 Dec. 2012 06:41am

Originally Posted by The Unintelligible

Just use what Chad suggested. It's what we used back in the day at least. The fabled string between methods.

Using regex for parsing HTML is bad practice anyway.

Nothing is good for parsing HTML besides HTML/XML parsers.

Also RegExp is way better than doing a string search.

**The Unintelligible** · 13 Dec. 2012 07:14am

Originally Posted by GAMEchief

Nothing is good for parsing HTML besides HTML/XML parsers.

Yes.

Originally Posted by GAMEchief

Also RegExp is way better than doing a string search.

No.

I read this article a while back.

If there's a simpler way to get necessary values, use it. Much more goes on behind the hood with regular expressions than a simple string search.

**323** · 13 Dec. 2012 03:43pm

Anyway, anyone want to teach me how to use those HTTP wrappers? I think I get it, but don't want to have to do three hours of guessing and checking when someone can just tell me.

**The Unintelligible** · 13 Dec. 2012 03:55pm

Originally Posted by Flareboy323

Anyway, anyone want to teach me how to use those HTTP wrappers? I think I get it, but don't want to have to do three hours of guessing and checking when someone can just tell me.

I've already given you pretty concrete HTTP wrappers. It isn't necessary to use another set.

Just include or import the library, instantiate the class. You're done.

Edit: I posted an example before. https://github.com/Isonyx/HTTPReques...rc/Tester.java

Just add the functions Chad posted to the wrapper or to your main Java file. Or somewhere else where the methods can be accessed.

**323** · 13 Dec. 2012 05:23pm

I was already using those HTTP Wrappers that you had given me before, the ones by Isonyx. How would I add those functions though? Reading chads code, I barely even know what they do :/

**The Unintelligible** · 13 Dec. 2012 05:55pm

Originally Posted by Flareboy323

I was already using those HTTP Wrappers that you had given me before, the ones by Isonyx. How would I add those functions though? Reading chads code, I barely even know what they do :/

Chad's code or whatever sucks. If you don't understand the code it means that it's bad code. Just use the other wrappers.

Copy and paste the snippets he provided and use them in the context of your project.

**GAMEchief** · 13 Dec. 2012 07:23pm

Originally Posted by The Unintelligible

If there's a simpler way to get necessary values, use it. Much more goes on behind the hood with regular expressions than a simple string search.

It depends on what you are using it for. Not all input fields can be found with a string search, such as if you don't know the order of the parameters, or if the parameters change each page load. RegExp can do more than look for all the contents between two literal strings, and oftentimes that is either necessary or decreases the time it takes to code or find. It's much more extensible, and especially given how this is being used to gather data from a website that is updated by a third party, extensibility is much more important than the microseconds saved by using a string search.

**The Unintelligible** · 13 Dec. 2012 07:48pm

Originally Posted by GAMEchief

It depends on what you are using it for. Not all input fields can be found with a string search, such as if you don't know the order of the parameters, or if the parameters change each page load. RegExp can do more than look for all the contents between two literal strings, and oftentimes that is either necessary or decreases the time it takes to code or find. It's much more extensible, and especially given how this is being used to gather data from a website that is updated by a third party, extensibility is much more important than the microseconds saved by using a string search.

Like I said, if there's a simpler way to parse text, use it. Your main point is contingent on the scenario of the user. Of course regular expressions are more apt for certain tasks, mainly being actual pattern matching (its original and primary purpose. e.g. e-mail validation is a task I'd use regex or some form of pattern matching for) and not "pure" string parsing (e.g. getting certain values in a body of text subject to change like HTML).

Generally speaking, avoid using RegExp whenever possible. As a rule of thumb I try to make RegExp a last resort kind of thing. Not use it because it's simply more convenient for the task at hand.

Edit: Everything you've said has also been addressed in the blog post I hotlinked. I'm assuming you haven't read it. HTML parsers are always the better choice for parsing HTML or markup. RegExp in some cases is better than a simple string search (depending on the complexity of the task) but never is it the better choice for parsing HTML.

Actually, to put it simply, as you said it depends on what you're using it for. If I'm going to parse simple HTML I'm going to use some other means of string parsing like string searching. If I'm going to parse something where I'm looking for more than string literals I'm going to use regex or an HTML parser (most likely the latter).

RegExp has poor readability and thus poor manageability. It has a steeper learning curve. It's also somewhat slower in execution. It's a hazard and pitfall you should often avoid.

Though it is probably a lot more extensible than string searching.

**GAMEchief** · 14 Dec. 2012 08:26am

I'm going to not read the marked out part. If there's anything insulting in it, someone tell me, and I'll rip Untinkerbell a new one.

I'm pretty sure RegExp is much faster than an HTML parser, at least for things as simple as finding values for attributes. RegExp doesn't get slow until you do shit like lookbacks or what-not. A basic value="(\d+)" or what have you will go way faster than parsing the DOM, but a parser would be way safer and easier to code, s

uld be more useful if you were scanning multiple values.

Just based on the OP, getting a single value would be easier/faster/more-extensible with RegExp than a string search; and I'd say a DOM parser would be great, but that's assuming Gaia uses valid markup, and I highly doubt they do. The only people you can really trust to use an HTML parser on would be something you wrote yourself.

**Chad** · 14 Dec. 2012 04:36pm

I love how you say that it's my code when originally it was Alex's.

**The Unintelligible** · 14 Dec. 2012 05:28pm

Originally Posted by GAMEchief

I'm going to not read the marked out part. If there's anything insulting in it, someone tell me, and I'll rip Untinkerbell a new one.

Lol, you've been pretty paranoid lately. Just read the marked out part regardless of whether it's insulting or not. It provides more insight to my points. Though just as an FYI, it isn't insulting.

If you choose not to it doesn't really matter though, it's not my main point or anything. I wish I said something insulting in it though in hindsight, I wanted to be ripped a new one. This old one is getting, old. ):

Bottom line is I would not consider RegExp as my first option for parsing HTML. To each his own though. Aside from that the Regex you gave OP wouldn't even work for their current login format unless I'm mistaken.

In conclusion, I'd typically use an HTML parser or something of the like to parse HTML. RegExp as a last resort.

**The Unintelligible** · 14 Dec. 2012 05:30pm

Originally Posted by Chad

I love how you say that it's my code when originally it was Alex's.

I'm just saying man. I thought it was your code. It looks pretty poorly written. No offense intended.

**GAMEchief** · 14 Dec. 2012 06:23pm

Originally Posted by The Unintelligible

Lol, you've been pretty paranoid lately. Just read the marked out part regardless of whether it's insulting or not. It provides more insight to my points. Though just as an FYI, it isn't insulting.

It was like 4am or some shit. I didn't have time to read it. Nor now.

Bottom line is I would not consider RegExp as my first option for parsing HTML.

It's all circumstantial. In the way it was described in OP, RegExp is the first option, imo. When it comes to handling entire documents, especially document manipulation, parsers are where it's at. If it's looking up a well-defined, standardized single string, maybe a string search. But I don't trust Gaia with using well-defined, standardized strings (especially on entire pages instead of a snippet), and I definitely don't trust them with valid markup. RegExp's extensibility seemed more fitting and more extensible, such that if Gaia changed the markup, a string search would break but a RegExp search would not.

e.g.
name="whatiwant" value="this value"
becoming
name="whatiwant" size="42" value="this value"

**Chad** · 15 Dec. 2012 05:53pm

Originally Posted by The Unintelligible

I'm just saying man. I thought it was your code. It looks pretty poorly written. No offense intended.

It's Java. Who the fuck likes Java around here?

Thread: Grabbing values in HTML using a programming languge is harder than I imagined

LinkBack

Thread Tools

Search Thread

Rate This Thread

Display

Grabbing values in HTML using a programming languge is harder than I imagined

Posting Permissions